A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.
The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.
The cancellation of bookings impact a hotel on various fronts:
The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.
Data Dictionary
# Library to suppress warnings or deprecation notes
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Library to split data
from sklearn.model_selection import train_test_split
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# Libraries to build decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To tune different models
from sklearn.model_selection import GridSearchCV
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
plot_confusion_matrix,
make_scorer,
)
# To build linear model for statistical analysis and prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
# To get diferent metric scores
from sklearn import metrics
from sklearn.metrics import accuracy_score, roc_curve, confusion_matrix, roc_auc_score
hotels = pd.read_csv("INNHotelsGroup.csv")
data = hotels.copy()
data.head()
| Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | INN00001 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224 | 2017 | 10 | 2 | Offline | 0 | 0 | 0 | 65.00 | 0 | Not_Canceled |
| 1 | INN00002 | 2 | 0 | 2 | 3 | Not Selected | 0 | Room_Type 1 | 5 | 2018 | 11 | 6 | Online | 0 | 0 | 0 | 106.68 | 1 | Not_Canceled |
| 2 | INN00003 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 2 | 28 | Online | 0 | 0 | 0 | 60.00 | 0 | Canceled |
| 3 | INN00004 | 2 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 211 | 2018 | 5 | 20 | Online | 0 | 0 | 0 | 100.00 | 0 | Canceled |
| 4 | INN00005 | 2 | 0 | 1 | 1 | Not Selected | 0 | Room_Type 1 | 48 | 2018 | 4 | 11 | Online | 0 | 0 | 0 | 94.50 | 0 | Canceled |
data.tail()
| Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 36270 | INN36271 | 3 | 0 | 2 | 6 | Meal Plan 1 | 0 | Room_Type 4 | 85 | 2018 | 8 | 3 | Online | 0 | 0 | 0 | 167.80 | 1 | Not_Canceled |
| 36271 | INN36272 | 2 | 0 | 1 | 3 | Meal Plan 1 | 0 | Room_Type 1 | 228 | 2018 | 10 | 17 | Online | 0 | 0 | 0 | 90.95 | 2 | Canceled |
| 36272 | INN36273 | 2 | 0 | 2 | 6 | Meal Plan 1 | 0 | Room_Type 1 | 148 | 2018 | 7 | 1 | Online | 0 | 0 | 0 | 98.39 | 2 | Not_Canceled |
| 36273 | INN36274 | 2 | 0 | 0 | 3 | Not Selected | 0 | Room_Type 1 | 63 | 2018 | 4 | 21 | Online | 0 | 0 | 0 | 94.50 | 0 | Canceled |
| 36274 | INN36275 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 207 | 2018 | 12 | 30 | Offline | 0 | 0 | 0 | 161.67 | 0 | Not_Canceled |
data.shape
(36275, 19)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36275 entries, 0 to 36274 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Booking_ID 36275 non-null object 1 no_of_adults 36275 non-null int64 2 no_of_children 36275 non-null int64 3 no_of_weekend_nights 36275 non-null int64 4 no_of_week_nights 36275 non-null int64 5 type_of_meal_plan 36275 non-null object 6 required_car_parking_space 36275 non-null int64 7 room_type_reserved 36275 non-null object 8 lead_time 36275 non-null int64 9 arrival_year 36275 non-null int64 10 arrival_month 36275 non-null int64 11 arrival_date 36275 non-null int64 12 market_segment_type 36275 non-null object 13 repeated_guest 36275 non-null int64 14 no_of_previous_cancellations 36275 non-null int64 15 no_of_previous_bookings_not_canceled 36275 non-null int64 16 avg_price_per_room 36275 non-null float64 17 no_of_special_requests 36275 non-null int64 18 booking_status 36275 non-null object dtypes: float64(1), int64(13), object(5) memory usage: 5.3+ MB
There are not any missing values in this dataset.
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| no_of_adults | 36275.0 | 1.844962 | 0.518715 | 0.0 | 2.0 | 2.00 | 2.0 | 4.0 |
| no_of_children | 36275.0 | 0.105279 | 0.402648 | 0.0 | 0.0 | 0.00 | 0.0 | 10.0 |
| no_of_weekend_nights | 36275.0 | 0.810724 | 0.870644 | 0.0 | 0.0 | 1.00 | 2.0 | 7.0 |
| no_of_week_nights | 36275.0 | 2.204300 | 1.410905 | 0.0 | 1.0 | 2.00 | 3.0 | 17.0 |
| required_car_parking_space | 36275.0 | 0.030986 | 0.173281 | 0.0 | 0.0 | 0.00 | 0.0 | 1.0 |
| lead_time | 36275.0 | 85.232557 | 85.930817 | 0.0 | 17.0 | 57.00 | 126.0 | 443.0 |
| arrival_year | 36275.0 | 2017.820427 | 0.383836 | 2017.0 | 2018.0 | 2018.00 | 2018.0 | 2018.0 |
| arrival_month | 36275.0 | 7.423653 | 3.069894 | 1.0 | 5.0 | 8.00 | 10.0 | 12.0 |
| arrival_date | 36275.0 | 15.596995 | 8.740447 | 1.0 | 8.0 | 16.00 | 23.0 | 31.0 |
| repeated_guest | 36275.0 | 0.025637 | 0.158053 | 0.0 | 0.0 | 0.00 | 0.0 | 1.0 |
| no_of_previous_cancellations | 36275.0 | 0.023349 | 0.368331 | 0.0 | 0.0 | 0.00 | 0.0 | 13.0 |
| no_of_previous_bookings_not_canceled | 36275.0 | 0.153411 | 1.754171 | 0.0 | 0.0 | 0.00 | 0.0 | 58.0 |
| avg_price_per_room | 36275.0 | 103.423539 | 35.089424 | 0.0 | 80.3 | 99.45 | 120.0 | 540.0 |
| no_of_special_requests | 36275.0 | 0.619655 | 0.786236 | 0.0 | 0.0 | 0.00 | 1.0 | 5.0 |
data.describe(include=["object", "bool"])
| Booking_ID | type_of_meal_plan | room_type_reserved | market_segment_type | booking_status | |
|---|---|---|---|---|---|
| count | 36275 | 36275 | 36275 | 36275 | 36275 |
| unique | 36275 | 4 | 7 | 5 | 2 |
| top | INN01418 | Meal Plan 1 | Room_Type 1 | Online | Not_Canceled |
| freq | 1 | 27835 | 28130 | 23214 | 24390 |
cat_columns = ["type_of_meal_plan", "room_type_reserved", "market_segment_type", "booking_status"]
for i in cat_columns:
print(data[i].value_counts())
print("*" * 50)
Meal Plan 1 27835 Not Selected 5130 Meal Plan 2 3305 Meal Plan 3 5 Name: type_of_meal_plan, dtype: int64 ************************************************** Room_Type 1 28130 Room_Type 4 6057 Room_Type 6 966 Room_Type 2 692 Room_Type 5 265 Room_Type 7 158 Room_Type 3 7 Name: room_type_reserved, dtype: int64 ************************************************** Online 23214 Offline 10528 Corporate 2017 Complementary 391 Aviation 125 Name: market_segment_type, dtype: int64 ************************************************** Not_Canceled 24390 Canceled 11885 Name: booking_status, dtype: int64 **************************************************
data.repeated_guest.unique()
array([0, 1])
Leading Questions:
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
histogram_boxplot(data, "no_of_adults")
histogram_boxplot(data, "no_of_children")
histogram_boxplot(data, "no_of_weekend_nights")
histogram_boxplot(data, "no_of_week_nights")
histogram_boxplot(data, "required_car_parking_space")
histogram_boxplot(data, "lead_time")
histogram_boxplot(data, "arrival_year")
histogram_boxplot(data, "arrival_month")
histogram_boxplot(data, "arrival_date")
histogram_boxplot(data, "repeated_guest")
histogram_boxplot(data, "no_of_previous_cancellations")
histogram_boxplot(data, "no_of_previous_bookings_not_canceled")
histogram_boxplot(data, "avg_price_per_room")
histogram_boxplot(data, "no_of_special_requests")
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
labeled_barplot(data, "type_of_meal_plan", perc=True)
labeled_barplot(data, "room_type_reserved", perc=True)
labeled_barplot(data, "market_segment_type", perc=True)
labeled_barplot(data, "booking_status", perc=True)
labeled_barplot(data, "no_of_adults", perc=True)
labeled_barplot(data, "no_of_children", perc=True)
labeled_barplot(data, "no_of_weekend_nights", perc=True)
labeled_barplot(data, "no_of_week_nights", perc=True, n=12)
labeled_barplot(data, "arrival_year", perc=True)
labeled_barplot(data, "arrival_month", perc=True)
labeled_barplot(data, "no_of_special_requests", perc=True)
data_repeated = data[data.repeated_guest==1]
#data_repeated.head()
labeled_barplot(data_repeated, "booking_status", perc=True)
# why are the categorical variables not included in this example, but included in online shoppers purchase
plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, cmap="Spectral")
plt.show()
sns.pairplot(data=data, hue='booking_status')
plt.show()
cols = data[
[
"no_of_weekend_nights",
"lead_time",
"avg_price_per_room",
"no_of_special_requests"
]
].columns.tolist()
plt.figure(figsize=(12, 7))
for i, variable in enumerate(cols):
plt.subplot(3, 2, i + 1)
sns.boxplot(data["booking_status"], data[variable], palette="PuBu")
plt.tight_layout()
plt.title(variable)
plt.show()
cols = data[
[
"no_of_weekend_nights",
"lead_time",
"avg_price_per_room",
"no_of_special_requests"
]
].columns.tolist()
plt.figure(figsize=(12, 7))
for i, variable in enumerate(cols):
plt.subplot(3, 2, i + 1)
sns.boxplot(data["booking_status"], data[variable], palette="PuBu", showfliers=False)
plt.tight_layout()
plt.title(variable)
plt.show()
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
stacked_barplot(data, "no_of_special_requests", "booking_status")
booking_status Canceled Not_Canceled All no_of_special_requests All 11885 24390 36275 0 8545 11232 19777 1 2703 8670 11373 2 637 3727 4364 3 0 675 675 4 0 78 78 5 0 8 8 ------------------------------------------------------------------------------------------------------------------------
plt.figure(figsize=(10, 5))
sns.boxplot(data["market_segment_type"], data["avg_price_per_room"], showfliers=False)
plt.show()
EDA Insights
Most of our data includes guests who have two adults in their party, zero children, spend zero to two week end nights in the hotel, spend 1 to 4 week nights in the hotel, do not require a car space, have a lead time of less than 200 days, arrived in 2018, are not repeated guests, do not have previous cancellations, have not previously booked with the hotel, spend the night in a $50 to %150 hotel room, and have 0 to 1 special requests.
data = data.drop("Booking_ID", axis=1)
# one hot encoding - get dummies
# creating dummy varibles
dummy_data = pd.get_dummies(
data,
columns=[
"type_of_meal_plan",
"room_type_reserved",
"market_segment_type",
"booking_status",
],
drop_first=True,
)
dummy_data.head()
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | arrival_year | arrival_month | arrival_date | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | type_of_meal_plan_Meal Plan 2 | type_of_meal_plan_Meal Plan 3 | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 3 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | market_segment_type_Complementary | market_segment_type_Corporate | market_segment_type_Offline | market_segment_type_Online | booking_status_Not_Canceled | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 0 | 1 | 2 | 0 | 224 | 2017 | 10 | 2 | 0 | 0 | 0 | 65.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 1 | 2 | 0 | 2 | 3 | 0 | 5 | 2018 | 11 | 6 | 0 | 0 | 0 | 106.68 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 2 | 1 | 0 | 2 | 1 | 0 | 1 | 2018 | 2 | 28 | 0 | 0 | 0 | 60.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 3 | 2 | 0 | 0 | 2 | 0 | 211 | 2018 | 5 | 20 | 0 | 0 | 0 | 100.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4 | 2 | 0 | 1 | 1 | 0 | 48 | 2018 | 4 | 11 | 0 | 0 | 0 | 94.50 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
# Let's look at the distribution of target variable
data["booking_status"].value_counts()
Not_Canceled 24390 Canceled 11885 Name: booking_status, dtype: int64
data["booking_status"].value_counts(1)
Not_Canceled 0.672364 Canceled 0.327636 Name: booking_status, dtype: float64
X = dummy_data.drop("booking_status_Not_Canceled", axis=1) # Features
y = dummy_data["booking_status_Not_Canceled"].astype("int64") # Labels (Target Variable)
X
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | arrival_year | arrival_month | arrival_date | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | type_of_meal_plan_Meal Plan 2 | type_of_meal_plan_Meal Plan 3 | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 3 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | market_segment_type_Complementary | market_segment_type_Corporate | market_segment_type_Offline | market_segment_type_Online | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 0 | 1 | 2 | 0 | 224 | 2017 | 10 | 2 | 0 | 0 | 0 | 65.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 2 | 0 | 2 | 3 | 0 | 5 | 2018 | 11 | 6 | 0 | 0 | 0 | 106.68 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 2 | 1 | 0 | 2 | 1 | 0 | 1 | 2018 | 2 | 28 | 0 | 0 | 0 | 60.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 3 | 2 | 0 | 0 | 2 | 0 | 211 | 2018 | 5 | 20 | 0 | 0 | 0 | 100.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 4 | 2 | 0 | 1 | 1 | 0 | 48 | 2018 | 4 | 11 | 0 | 0 | 0 | 94.50 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 36270 | 3 | 0 | 2 | 6 | 0 | 85 | 2018 | 8 | 3 | 0 | 0 | 0 | 167.80 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 36271 | 2 | 0 | 1 | 3 | 0 | 228 | 2018 | 10 | 17 | 0 | 0 | 0 | 90.95 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 36272 | 2 | 0 | 2 | 6 | 0 | 148 | 2018 | 7 | 1 | 0 | 0 | 0 | 98.39 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 36273 | 2 | 0 | 0 | 3 | 0 | 63 | 2018 | 4 | 21 | 0 | 0 | 0 | 94.50 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 36274 | 2 | 0 | 1 | 2 | 0 | 207 | 2018 | 12 | 30 | 0 | 0 | 0 | 161.67 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
36275 rows × 27 columns
# Splitting data into training and test set:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
print(X_train.shape, X_test.shape)
(25392, 27) (10883, 27)
print("Number of rows in train data =", X_train.shape[0])
print("Number of rows in test data =", X_test.shape[0])
Number of rows in train data = 25392 Number of rows in test data = 10883
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Percentage of classes in training set: 1 0.670644 0 0.329356 Name: booking_status_Not_Canceled, dtype: float64 Percentage of classes in test set: 1 0.676376 0 0.323624 Name: booking_status_Not_Canceled, dtype: float64
There is only a little bit of class imbalance
# fitting the model on training set
logit = sm.Logit(y_train, X_train.astype(float))
lg = logit.fit()
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.426215
Iterations: 35
/Users/sydneystolle/opt/anaconda3/lib/python3.8/site-packages/statsmodels/base/model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
# let's print the logistic regression summary
print(lg.summary())
Logit Regression Results
=======================================================================================
Dep. Variable: booking_status_Not_Canceled No. Observations: 25392
Model: Logit Df Residuals: 25365
Method: MLE Df Model: 26
Date: Fri, 25 Feb 2022 Pseudo R-squ.: 0.3274
Time: 23:29:00 Log-Likelihood: -10822.
converged: False LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
no_of_adults -0.1009 0.038 -2.685 0.007 -0.175 -0.027
no_of_children -0.1629 0.057 -2.867 0.004 -0.274 -0.052
no_of_weekend_nights -0.1082 0.020 -5.478 0.000 -0.147 -0.069
no_of_week_nights -0.0405 0.012 -3.296 0.001 -0.065 -0.016
required_car_parking_space 1.6213 0.138 11.781 0.000 1.352 1.891
lead_time -0.0163 0.000 -63.030 0.000 -0.017 -0.016
arrival_year 0.0012 0.000 9.205 0.000 0.001 0.001
arrival_month 0.0615 0.006 10.261 0.000 0.050 0.073
arrival_date -0.0005 0.002 -0.234 0.815 -0.004 0.003
repeated_guest 2.3698 0.609 3.892 0.000 1.176 3.563
no_of_previous_cancellations -0.2667 0.086 -3.107 0.002 -0.435 -0.098
no_of_previous_bookings_not_canceled 0.1626 0.147 1.109 0.267 -0.125 0.450
avg_price_per_room -0.0201 0.001 -28.012 0.000 -0.022 -0.019
no_of_special_requests 1.4550 0.030 48.644 0.000 1.396 1.514
type_of_meal_plan_Meal Plan 2 -0.0464 0.064 -0.724 0.469 -0.172 0.079
type_of_meal_plan_Meal Plan 3 -19.2937 1.23e+04 -0.002 0.999 -2.42e+04 2.42e+04
type_of_meal_plan_Not Selected -0.3442 0.052 -6.571 0.000 -0.447 -0.242
room_type_reserved_Room_Type 2 0.3753 0.130 2.883 0.004 0.120 0.630
room_type_reserved_Room_Type 3 -0.0511 1.298 -0.039 0.969 -2.596 2.494
room_type_reserved_Room_Type 4 0.2599 0.053 4.903 0.000 0.156 0.364
room_type_reserved_Room_Type 5 0.6820 0.210 3.253 0.001 0.271 1.093
room_type_reserved_Room_Type 6 1.0239 0.146 7.010 0.000 0.738 1.310
room_type_reserved_Room_Type 7 1.4602 0.294 4.969 0.000 0.884 2.036
market_segment_type_Complementary 25.4264 1.23e+04 0.002 0.998 -2.42e+04 2.42e+04
market_segment_type_Corporate 1.3119 0.266 4.927 0.000 0.790 1.834
market_segment_type_Offline 2.3307 0.255 9.149 0.000 1.831 2.830
market_segment_type_Online 0.4995 0.252 1.985 0.047 0.006 0.993
========================================================================================================
# predicting on training set
# default threshold is 0.5, if predicted probability is greater than 0.5 the observation will be classified as 1
pred_train = lg.predict(X_train) > 0.5
pred_train = np.round(pred_train)
cm = confusion_matrix(y_train, pred_train)
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()
print("Accuracy on training set : ", accuracy_score(y_train, pred_train))
Accuracy on training set : 0.8035995589161941
# let's check the VIF of the predictors
vif_series = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
index=X_train.columns,
dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series))
VIF values: no_of_adults 18.327604 no_of_children 2.110591 no_of_weekend_nights 2.003719 no_of_week_nights 3.776277 required_car_parking_space 1.073702 lead_time 2.474124 arrival_year 329.439218 arrival_month 7.206294 arrival_date 4.219627 repeated_guest 1.827912 no_of_previous_cancellations 1.400396 no_of_previous_bookings_not_canceled 1.659869 avg_price_per_room 18.598052 no_of_special_requests 2.016933 type_of_meal_plan_Meal Plan 2 1.325659 type_of_meal_plan_Meal Plan 3 1.025416 type_of_meal_plan_Not Selected 1.437891 room_type_reserved_Room_Type 2 1.122473 room_type_reserved_Room_Type 3 1.003472 room_type_reserved_Room_Type 4 1.630708 room_type_reserved_Room_Type 5 1.034848 room_type_reserved_Room_Type 6 2.018977 room_type_reserved_Room_Type 7 1.120129 market_segment_type_Complementary 4.548425 market_segment_type_Corporate 17.887527 market_segment_type_Offline 89.908876 market_segment_type_Online 198.241101 dtype: float64
X_train1 = X_train.drop("arrival_year", axis=1)
# let's check the VIF of the predictors
vif_series = pd.Series(
[variance_inflation_factor(X_train1.values, i) for i in range(X_train1.shape[1])],
index=X_train1.columns,
dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series))
VIF values: no_of_adults 18.257632 no_of_children 2.110407 no_of_weekend_nights 1.994877 no_of_week_nights 3.734755 required_car_parking_space 1.073616 lead_time 2.473596 arrival_month 7.103096 arrival_date 4.176361 repeated_guest 1.819480 no_of_previous_cancellations 1.400361 no_of_previous_bookings_not_canceled 1.659678 avg_price_per_room 18.000805 no_of_special_requests 2.012391 type_of_meal_plan_Meal Plan 2 1.324121 type_of_meal_plan_Meal Plan 3 1.025409 type_of_meal_plan_Not Selected 1.434536 room_type_reserved_Room_Type 2 1.121631 room_type_reserved_Room_Type 3 1.003469 room_type_reserved_Room_Type 4 1.630707 room_type_reserved_Room_Type 5 1.034417 room_type_reserved_Room_Type 6 2.012741 room_type_reserved_Room_Type 7 1.118297 market_segment_type_Complementary 1.318501 market_segment_type_Corporate 2.570731 market_segment_type_Offline 10.198173 market_segment_type_Online 25.137772 dtype: float64
X_train2 = X_train1.drop("market_segment_type_Online", axis=1)
# let's check the VIF of the predictors
vif_series = pd.Series(
[variance_inflation_factor(X_train2.values, i) for i in range(X_train2.shape[1])],
index=X_train2.columns,
dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series))
VIF values: no_of_adults 14.099249 no_of_children 2.108158 no_of_weekend_nights 1.969233 no_of_week_nights 3.563059 required_car_parking_space 1.073567 lead_time 2.456340 arrival_month 6.528665 arrival_date 3.902745 repeated_guest 1.814445 no_of_previous_cancellations 1.399309 no_of_previous_bookings_not_canceled 1.659677 avg_price_per_room 13.383149 no_of_special_requests 2.008093 type_of_meal_plan_Meal Plan 2 1.308653 type_of_meal_plan_Meal Plan 3 1.025363 type_of_meal_plan_Not Selected 1.369358 room_type_reserved_Room_Type 2 1.108117 room_type_reserved_Room_Type 3 1.003464 room_type_reserved_Room_Type 4 1.575860 room_type_reserved_Room_Type 5 1.031159 room_type_reserved_Room_Type 6 1.951037 room_type_reserved_Room_Type 7 1.098293 market_segment_type_Complementary 1.224883 market_segment_type_Corporate 1.452694 market_segment_type_Offline 1.986718 dtype: float64
X_train3 = X_train2.drop("no_of_adults", axis=1)
# let's check the VIF of the predictors
vif_series = pd.Series(
[variance_inflation_factor(X_train3.values, i) for i in range(X_train3.shape[1])],
index=X_train3.columns,
dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series))
VIF values: no_of_children 2.073037 no_of_weekend_nights 1.927577 no_of_week_nights 3.469732 required_car_parking_space 1.073078 lead_time 2.387198 arrival_month 6.392902 arrival_date 3.788891 repeated_guest 1.814288 no_of_previous_cancellations 1.397232 no_of_previous_bookings_not_canceled 1.658564 avg_price_per_room 9.089020 no_of_special_requests 1.967900 type_of_meal_plan_Meal Plan 2 1.302158 type_of_meal_plan_Meal Plan 3 1.025274 type_of_meal_plan_Not Selected 1.320664 room_type_reserved_Room_Type 2 1.107983 room_type_reserved_Room_Type 3 1.003398 room_type_reserved_Room_Type 4 1.539797 room_type_reserved_Room_Type 5 1.030283 room_type_reserved_Room_Type 6 1.949634 room_type_reserved_Room_Type 7 1.098155 market_segment_type_Complementary 1.188310 market_segment_type_Corporate 1.451947 market_segment_type_Offline 1.925227 dtype: float64
# fitting the model on training set
logit3 = sm.Logit(y_train, X_train3.astype(float))
lg3 = logit3.fit()
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.442911
Iterations: 35
/Users/sydneystolle/opt/anaconda3/lib/python3.8/site-packages/statsmodels/base/model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
# summary of initial logistic regression model
print(lg3.summary())
Logit Regression Results
=======================================================================================
Dep. Variable: booking_status_Not_Canceled No. Observations: 25392
Model: Logit Df Residuals: 25368
Method: MLE Df Model: 23
Date: Fri, 25 Feb 2022 Pseudo R-squ.: 0.3011
Time: 23:29:02 Log-Likelihood: -11246.
converged: False LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
no_of_children -0.2078 0.056 -3.731 0.000 -0.317 -0.099
no_of_weekend_nights 0.0004 0.019 0.021 0.983 -0.037 0.038
no_of_week_nights 0.0589 0.012 5.067 0.000 0.036 0.082
required_car_parking_space 1.5667 0.135 11.626 0.000 1.303 1.831
lead_time -0.0152 0.000 -61.562 0.000 -0.016 -0.015
arrival_month 0.1080 0.005 19.764 0.000 0.097 0.119
arrival_date 0.0171 0.002 9.505 0.000 0.014 0.021
repeated_guest 2.6492 0.628 4.222 0.000 1.419 3.879
no_of_previous_cancellations -0.2184 0.083 -2.632 0.008 -0.381 -0.056
no_of_previous_bookings_not_canceled 0.1862 0.164 1.134 0.257 -0.136 0.508
avg_price_per_room -0.0060 0.000 -13.074 0.000 -0.007 -0.005
no_of_special_requests 1.4728 0.029 50.491 0.000 1.416 1.530
type_of_meal_plan_Meal Plan 2 -0.3126 0.061 -5.088 0.000 -0.433 -0.192
type_of_meal_plan_Meal Plan 3 -11.7340 135.518 -0.087 0.931 -277.344 253.876
type_of_meal_plan_Not Selected 0.0952 0.049 1.943 0.052 -0.001 0.191
room_type_reserved_Room_Type 2 0.7801 0.126 6.177 0.000 0.533 1.028
room_type_reserved_Room_Type 3 0.0526 1.375 0.038 0.969 -2.643 2.748
room_type_reserved_Room_Type 4 0.0587 0.050 1.172 0.241 -0.039 0.157
room_type_reserved_Room_Type 5 0.2499 0.208 1.201 0.230 -0.158 0.658
room_type_reserved_Room_Type 6 0.1451 0.140 1.033 0.301 -0.130 0.420
room_type_reserved_Room_Type 7 0.0609 0.280 0.217 0.828 -0.489 0.610
market_segment_type_Complementary 24.9686 971.781 0.026 0.980 -1879.688 1929.625
market_segment_type_Corporate 1.4906 0.097 15.370 0.000 1.301 1.681
market_segment_type_Offline 2.2483 0.049 45.830 0.000 2.152 2.344
========================================================================================================
# dropping highest p value first
X_train4 = X_train3.drop("no_of_weekend_nights", axis=1)
X_train4
| no_of_children | no_of_week_nights | required_car_parking_space | lead_time | arrival_month | arrival_date | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | type_of_meal_plan_Meal Plan 2 | type_of_meal_plan_Meal Plan 3 | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 3 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | market_segment_type_Complementary | market_segment_type_Corporate | market_segment_type_Offline | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13662 | 0 | 1 | 0 | 163 | 10 | 15 | 0 | 0 | 0 | 115.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 26641 | 0 | 3 | 0 | 113 | 3 | 31 | 0 | 0 | 0 | 78.15 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 17835 | 0 | 3 | 0 | 359 | 10 | 14 | 0 | 0 | 0 | 78.00 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 21485 | 0 | 3 | 0 | 136 | 6 | 29 | 0 | 0 | 0 | 85.50 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5670 | 0 | 2 | 0 | 21 | 8 | 15 | 0 | 0 | 0 | 151.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7813 | 0 | 1 | 0 | 66 | 11 | 12 | 0 | 0 | 0 | 105.33 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 32511 | 0 | 2 | 0 | 70 | 4 | 22 | 0 | 0 | 0 | 105.30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5192 | 0 | 2 | 0 | 24 | 6 | 6 | 0 | 0 | 0 | 120.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 12172 | 2 | 1 | 0 | 3 | 3 | 21 | 0 | 0 | 0 | 181.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 33003 | 0 | 3 | 0 | 222 | 8 | 31 | 0 | 0 | 0 | 96.30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
25392 rows × 23 columns
# fitting the model on training set
logit4 = sm.Logit(y_train, X_train4.astype(float))
lg4 = logit4.fit()
pred_train4 = lg4.predict(X_train4)
pred_train4 = np.round(pred_train4)
print("Accuracy on training set : ", accuracy_score(y_train, pred_train4))
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.442911
Iterations: 35
Accuracy on training set : 0.7930056710775047
/Users/sydneystolle/opt/anaconda3/lib/python3.8/site-packages/statsmodels/base/model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
# summary of initial logistic regression model
print(lg4.summary())
Logit Regression Results
=======================================================================================
Dep. Variable: booking_status_Not_Canceled No. Observations: 25392
Model: Logit Df Residuals: 25369
Method: MLE Df Model: 22
Date: Fri, 25 Feb 2022 Pseudo R-squ.: 0.3011
Time: 23:29:02 Log-Likelihood: -11246.
converged: False LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
no_of_children -0.2077 0.056 -3.731 0.000 -0.317 -0.099
no_of_week_nights 0.0590 0.011 5.219 0.000 0.037 0.081
required_car_parking_space 1.5666 0.135 11.626 0.000 1.303 1.831
lead_time -0.0152 0.000 -61.598 0.000 -0.016 -0.015
arrival_month 0.1080 0.005 19.787 0.000 0.097 0.119
arrival_date 0.0171 0.002 9.552 0.000 0.014 0.021
repeated_guest 2.6493 0.627 4.222 0.000 1.419 3.879
no_of_previous_cancellations -0.2184 0.083 -2.632 0.008 -0.381 -0.056
no_of_previous_bookings_not_canceled 0.1862 0.164 1.134 0.257 -0.136 0.508
avg_price_per_room -0.0060 0.000 -13.113 0.000 -0.007 -0.005
no_of_special_requests 1.4728 0.029 50.528 0.000 1.416 1.530
type_of_meal_plan_Meal Plan 2 -0.3126 0.061 -5.089 0.000 -0.433 -0.192
type_of_meal_plan_Meal Plan 3 -21.7026 1.98e+04 -0.001 0.999 -3.88e+04 3.87e+04
type_of_meal_plan_Not Selected 0.0953 0.049 1.943 0.052 -0.001 0.191
room_type_reserved_Room_Type 2 0.7801 0.126 6.179 0.000 0.533 1.028
room_type_reserved_Room_Type 3 0.0526 1.375 0.038 0.969 -2.643 2.748
room_type_reserved_Room_Type 4 0.0588 0.050 1.173 0.241 -0.039 0.157
room_type_reserved_Room_Type 5 0.2499 0.208 1.201 0.230 -0.158 0.658
room_type_reserved_Room_Type 6 0.1450 0.140 1.033 0.301 -0.130 0.420
room_type_reserved_Room_Type 7 0.0609 0.280 0.217 0.828 -0.489 0.610
market_segment_type_Complementary 30.7960 1.98e+04 0.002 0.999 -3.87e+04 3.88e+04
market_segment_type_Corporate 1.4905 0.097 15.397 0.000 1.301 1.680
market_segment_type_Offline 2.2483 0.049 45.837 0.000 2.152 2.344
========================================================================================================
# dropping highest p value first
X_train5 = X_train4.drop("type_of_meal_plan_Meal Plan 3", axis=1)
X_train5
| no_of_children | no_of_week_nights | required_car_parking_space | lead_time | arrival_month | arrival_date | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | type_of_meal_plan_Meal Plan 2 | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 3 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | market_segment_type_Complementary | market_segment_type_Corporate | market_segment_type_Offline | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13662 | 0 | 1 | 0 | 163 | 10 | 15 | 0 | 0 | 0 | 115.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 26641 | 0 | 3 | 0 | 113 | 3 | 31 | 0 | 0 | 0 | 78.15 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 17835 | 0 | 3 | 0 | 359 | 10 | 14 | 0 | 0 | 0 | 78.00 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 21485 | 0 | 3 | 0 | 136 | 6 | 29 | 0 | 0 | 0 | 85.50 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5670 | 0 | 2 | 0 | 21 | 8 | 15 | 0 | 0 | 0 | 151.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7813 | 0 | 1 | 0 | 66 | 11 | 12 | 0 | 0 | 0 | 105.33 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 32511 | 0 | 2 | 0 | 70 | 4 | 22 | 0 | 0 | 0 | 105.30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5192 | 0 | 2 | 0 | 24 | 6 | 6 | 0 | 0 | 0 | 120.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 12172 | 2 | 1 | 0 | 3 | 3 | 21 | 0 | 0 | 0 | 181.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 33003 | 0 | 3 | 0 | 222 | 8 | 31 | 0 | 0 | 0 | 96.30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
25392 rows × 22 columns
# fitting the model on training set
logit5 = sm.Logit(y_train, X_train5.astype(float))
lg5 = logit5.fit()
pred_train5 = lg5.predict(X_train5)
pred_train5 = np.round(pred_train5)
print("Accuracy on training set : ", accuracy_score(y_train, pred_train5))
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.442992
Iterations: 35
Accuracy on training set : 0.7929269061121613
/Users/sydneystolle/opt/anaconda3/lib/python3.8/site-packages/statsmodels/base/model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
# summary of initial logistic regression model
print(lg5.summary())
Logit Regression Results
=======================================================================================
Dep. Variable: booking_status_Not_Canceled No. Observations: 25392
Model: Logit Df Residuals: 25370
Method: MLE Df Model: 21
Date: Fri, 25 Feb 2022 Pseudo R-squ.: 0.3010
Time: 23:29:03 Log-Likelihood: -11248.
converged: False LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
no_of_children -0.2074 0.056 -3.725 0.000 -0.317 -0.098
no_of_week_nights 0.0592 0.011 5.239 0.000 0.037 0.081
required_car_parking_space 1.5669 0.135 11.629 0.000 1.303 1.831
lead_time -0.0152 0.000 -61.590 0.000 -0.016 -0.015
arrival_month 0.1080 0.005 19.796 0.000 0.097 0.119
arrival_date 0.0171 0.002 9.553 0.000 0.014 0.021
repeated_guest 2.6494 0.627 4.223 0.000 1.420 3.879
no_of_previous_cancellations -0.2183 0.083 -2.631 0.009 -0.381 -0.056
no_of_previous_bookings_not_canceled 0.1862 0.164 1.134 0.257 -0.136 0.508
avg_price_per_room -0.0060 0.000 -13.154 0.000 -0.007 -0.005
no_of_special_requests 1.4729 0.029 50.536 0.000 1.416 1.530
type_of_meal_plan_Meal Plan 2 -0.3114 0.061 -5.070 0.000 -0.432 -0.191
type_of_meal_plan_Not Selected 0.0958 0.049 1.954 0.051 -0.000 0.192
room_type_reserved_Room_Type 2 0.7801 0.126 6.179 0.000 0.533 1.028
room_type_reserved_Room_Type 3 0.0537 1.375 0.039 0.969 -2.641 2.749
room_type_reserved_Room_Type 4 0.0597 0.050 1.192 0.233 -0.038 0.158
room_type_reserved_Room_Type 5 0.2510 0.208 1.207 0.228 -0.157 0.659
room_type_reserved_Room_Type 6 0.1464 0.140 1.044 0.297 -0.129 0.421
room_type_reserved_Room_Type 7 0.0630 0.280 0.225 0.822 -0.486 0.613
market_segment_type_Complementary 24.4995 1.61e+04 0.002 0.999 -3.14e+04 3.15e+04
market_segment_type_Corporate 1.4911 0.097 15.404 0.000 1.301 1.681
market_segment_type_Offline 2.2471 0.049 45.827 0.000 2.151 2.343
========================================================================================================
# dropping highest p value first
X_train6 = X_train5.drop("market_segment_type_Complementary", axis=1)
X_train6
| no_of_children | no_of_week_nights | required_car_parking_space | lead_time | arrival_month | arrival_date | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | type_of_meal_plan_Meal Plan 2 | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 3 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | market_segment_type_Corporate | market_segment_type_Offline | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13662 | 0 | 1 | 0 | 163 | 10 | 15 | 0 | 0 | 0 | 115.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 26641 | 0 | 3 | 0 | 113 | 3 | 31 | 0 | 0 | 0 | 78.15 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 17835 | 0 | 3 | 0 | 359 | 10 | 14 | 0 | 0 | 0 | 78.00 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 21485 | 0 | 3 | 0 | 136 | 6 | 29 | 0 | 0 | 0 | 85.50 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5670 | 0 | 2 | 0 | 21 | 8 | 15 | 0 | 0 | 0 | 151.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7813 | 0 | 1 | 0 | 66 | 11 | 12 | 0 | 0 | 0 | 105.33 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 32511 | 0 | 2 | 0 | 70 | 4 | 22 | 0 | 0 | 0 | 105.30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5192 | 0 | 2 | 0 | 24 | 6 | 6 | 0 | 0 | 0 | 120.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 12172 | 2 | 1 | 0 | 3 | 3 | 21 | 0 | 0 | 0 | 181.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 33003 | 0 | 3 | 0 | 222 | 8 | 31 | 0 | 0 | 0 | 96.30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
25392 rows × 21 columns
# fitting the model on training set
logit6 = sm.Logit(y_train, X_train6.astype(float))
lg6 = logit6.fit()
pred_train6 = lg6.predict(X_train6)
pred_train6 = np.round(pred_train6)
print("Accuracy on training set : ", accuracy_score(y_train, pred_train6))
Optimization terminated successfully.
Current function value: 0.444636
Iterations 11
Accuracy on training set : 0.793241965973535
# summary of initial logistic regression model
print(lg6.summary())
Logit Regression Results
=======================================================================================
Dep. Variable: booking_status_Not_Canceled No. Observations: 25392
Model: Logit Df Residuals: 25371
Method: MLE Df Model: 20
Date: Fri, 25 Feb 2022 Pseudo R-squ.: 0.2984
Time: 23:29:03 Log-Likelihood: -11290.
converged: True LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
no_of_children -0.2027 0.055 -3.654 0.000 -0.311 -0.094
no_of_week_nights 0.0611 0.011 5.409 0.000 0.039 0.083
required_car_parking_space 1.5787 0.135 11.722 0.000 1.315 1.843
lead_time -0.0152 0.000 -61.926 0.000 -0.016 -0.015
arrival_month 0.1137 0.005 21.100 0.000 0.103 0.124
arrival_date 0.0183 0.002 10.290 0.000 0.015 0.022
repeated_guest 2.7702 0.622 4.451 0.000 1.550 3.990
no_of_previous_cancellations -0.2283 0.083 -2.757 0.006 -0.391 -0.066
no_of_previous_bookings_not_canceled 0.1917 0.167 1.149 0.250 -0.135 0.519
avg_price_per_room -0.0066 0.000 -14.474 0.000 -0.007 -0.006
no_of_special_requests 1.4725 0.029 50.582 0.000 1.415 1.530
type_of_meal_plan_Meal Plan 2 -0.3005 0.062 -4.886 0.000 -0.421 -0.180
type_of_meal_plan_Not Selected 0.0876 0.049 1.788 0.074 -0.008 0.184
room_type_reserved_Room_Type 2 0.7849 0.126 6.234 0.000 0.538 1.032
room_type_reserved_Room_Type 3 0.3564 1.212 0.294 0.769 -2.020 2.733
room_type_reserved_Room_Type 4 0.0718 0.050 1.435 0.151 -0.026 0.170
room_type_reserved_Room_Type 5 0.3186 0.204 1.558 0.119 -0.082 0.719
room_type_reserved_Room_Type 6 0.1717 0.140 1.226 0.220 -0.103 0.446
room_type_reserved_Room_Type 7 0.1555 0.276 0.564 0.573 -0.385 0.696
market_segment_type_Corporate 1.4771 0.097 15.261 0.000 1.287 1.667
market_segment_type_Offline 2.2393 0.049 45.669 0.000 2.143 2.335
========================================================================================================
# dropping highest p value first
X_train7 = X_train6.drop("room_type_reserved_Room_Type 3", axis=1)
X_train7
| no_of_children | no_of_week_nights | required_car_parking_space | lead_time | arrival_month | arrival_date | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | type_of_meal_plan_Meal Plan 2 | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | market_segment_type_Corporate | market_segment_type_Offline | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13662 | 0 | 1 | 0 | 163 | 10 | 15 | 0 | 0 | 0 | 115.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 26641 | 0 | 3 | 0 | 113 | 3 | 31 | 0 | 0 | 0 | 78.15 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 17835 | 0 | 3 | 0 | 359 | 10 | 14 | 0 | 0 | 0 | 78.00 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 21485 | 0 | 3 | 0 | 136 | 6 | 29 | 0 | 0 | 0 | 85.50 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5670 | 0 | 2 | 0 | 21 | 8 | 15 | 0 | 0 | 0 | 151.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7813 | 0 | 1 | 0 | 66 | 11 | 12 | 0 | 0 | 0 | 105.33 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 32511 | 0 | 2 | 0 | 70 | 4 | 22 | 0 | 0 | 0 | 105.30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5192 | 0 | 2 | 0 | 24 | 6 | 6 | 0 | 0 | 0 | 120.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 12172 | 2 | 1 | 0 | 3 | 3 | 21 | 0 | 0 | 0 | 181.00 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 33003 | 0 | 3 | 0 | 222 | 8 | 31 | 0 | 0 | 0 | 96.30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
25392 rows × 20 columns
# fitting the model on training set
logit7 = sm.Logit(y_train, X_train7.astype(float))
lg7 = logit7.fit()
pred_train7 = lg7.predict(X_train7)
pred_train7 = np.round(pred_train7)
print("Accuracy on training set : ", accuracy_score(y_train, pred_train7))
Optimization terminated successfully.
Current function value: 0.444638
Iterations 11
Accuracy on training set : 0.793241965973535
# summary of initial logistic regression model
print(lg7.summary())
Logit Regression Results
=======================================================================================
Dep. Variable: booking_status_Not_Canceled No. Observations: 25392
Model: Logit Df Residuals: 25372
Method: MLE Df Model: 19
Date: Fri, 25 Feb 2022 Pseudo R-squ.: 0.2984
Time: 23:29:03 Log-Likelihood: -11290.
converged: True LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
no_of_children -0.2027 0.055 -3.654 0.000 -0.311 -0.094
no_of_week_nights 0.0611 0.011 5.407 0.000 0.039 0.083
required_car_parking_space 1.5787 0.135 11.722 0.000 1.315 1.843
lead_time -0.0152 0.000 -61.926 0.000 -0.016 -0.015
arrival_month 0.1137 0.005 21.109 0.000 0.103 0.124
arrival_date 0.0183 0.002 10.292 0.000 0.015 0.022
repeated_guest 2.7701 0.622 4.451 0.000 1.550 3.990
no_of_previous_cancellations -0.2283 0.083 -2.757 0.006 -0.391 -0.066
no_of_previous_bookings_not_canceled 0.1917 0.167 1.149 0.250 -0.135 0.519
avg_price_per_room -0.0066 0.000 -14.477 0.000 -0.007 -0.006
no_of_special_requests 1.4725 0.029 50.581 0.000 1.415 1.530
type_of_meal_plan_Meal Plan 2 -0.3006 0.062 -4.888 0.000 -0.421 -0.180
type_of_meal_plan_Not Selected 0.0877 0.049 1.788 0.074 -0.008 0.184
room_type_reserved_Room_Type 2 0.7849 0.126 6.234 0.000 0.538 1.032
room_type_reserved_Room_Type 4 0.0717 0.050 1.434 0.152 -0.026 0.170
room_type_reserved_Room_Type 5 0.3185 0.204 1.558 0.119 -0.082 0.719
room_type_reserved_Room_Type 6 0.1717 0.140 1.226 0.220 -0.103 0.446
room_type_reserved_Room_Type 7 0.1555 0.276 0.564 0.573 -0.385 0.696
market_segment_type_Corporate 1.4770 0.097 15.260 0.000 1.287 1.667
market_segment_type_Offline 2.2393 0.049 45.668 0.000 2.143 2.335
========================================================================================================
# dropping highest p value first
X_train8 = X_train7.drop("room_type_reserved_Room_Type 7", axis=1)
X_train8
| no_of_children | no_of_week_nights | required_car_parking_space | lead_time | arrival_month | arrival_date | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | type_of_meal_plan_Meal Plan 2 | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | market_segment_type_Corporate | market_segment_type_Offline | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13662 | 0 | 1 | 0 | 163 | 10 | 15 | 0 | 0 | 0 | 115.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 26641 | 0 | 3 | 0 | 113 | 3 | 31 | 0 | 0 | 0 | 78.15 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 17835 | 0 | 3 | 0 | 359 | 10 | 14 | 0 | 0 | 0 | 78.00 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 21485 | 0 | 3 | 0 | 136 | 6 | 29 | 0 | 0 | 0 | 85.50 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5670 | 0 | 2 | 0 | 21 | 8 | 15 | 0 | 0 | 0 | 151.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7813 | 0 | 1 | 0 | 66 | 11 | 12 | 0 | 0 | 0 | 105.33 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 32511 | 0 | 2 | 0 | 70 | 4 | 22 | 0 | 0 | 0 | 105.30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5192 | 0 | 2 | 0 | 24 | 6 | 6 | 0 | 0 | 0 | 120.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 12172 | 2 | 1 | 0 | 3 | 3 | 21 | 0 | 0 | 0 | 181.00 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 33003 | 0 | 3 | 0 | 222 | 8 | 31 | 0 | 0 | 0 | 96.30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
25392 rows × 19 columns
# fitting the model on training set
logit8 = sm.Logit(y_train, X_train8.astype(float))
lg8 = logit8.fit()
pred_train8 = lg8.predict(X_train8)
pred_train8 = np.round(pred_train8)
print("Accuracy on training set : ", accuracy_score(y_train, pred_train8))
Optimization terminated successfully.
Current function value: 0.444645
Iterations 11
Accuracy on training set : 0.7930844360428482
# summary of initial logistic regression model
print(lg8.summary())
Logit Regression Results
=======================================================================================
Dep. Variable: booking_status_Not_Canceled No. Observations: 25392
Model: Logit Df Residuals: 25373
Method: MLE Df Model: 18
Date: Fri, 25 Feb 2022 Pseudo R-squ.: 0.2984
Time: 23:29:03 Log-Likelihood: -11290.
converged: True LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
no_of_children -0.1977 0.055 -3.611 0.000 -0.305 -0.090
no_of_week_nights 0.0610 0.011 5.400 0.000 0.039 0.083
required_car_parking_space 1.5779 0.135 11.717 0.000 1.314 1.842
lead_time -0.0152 0.000 -61.958 0.000 -0.016 -0.015
arrival_month 0.1136 0.005 21.111 0.000 0.103 0.124
arrival_date 0.0183 0.002 10.278 0.000 0.015 0.022
repeated_guest 2.7716 0.622 4.453 0.000 1.552 3.991
no_of_previous_cancellations -0.2285 0.083 -2.760 0.006 -0.391 -0.066
no_of_previous_bookings_not_canceled 0.1919 0.167 1.151 0.250 -0.135 0.519
avg_price_per_room -0.0065 0.000 -14.561 0.000 -0.007 -0.006
no_of_special_requests 1.4724 0.029 50.576 0.000 1.415 1.529
type_of_meal_plan_Meal Plan 2 -0.3019 0.061 -4.912 0.000 -0.422 -0.181
type_of_meal_plan_Not Selected 0.0862 0.049 1.761 0.078 -0.010 0.182
room_type_reserved_Room_Type 2 0.7806 0.126 6.212 0.000 0.534 1.027
room_type_reserved_Room_Type 4 0.0690 0.050 1.387 0.166 -0.029 0.167
room_type_reserved_Room_Type 5 0.3155 0.204 1.544 0.123 -0.085 0.716
room_type_reserved_Room_Type 6 0.1583 0.138 1.147 0.251 -0.112 0.429
market_segment_type_Corporate 1.4764 0.097 15.254 0.000 1.287 1.666
market_segment_type_Offline 2.2387 0.049 45.667 0.000 2.143 2.335
========================================================================================================
# dropping highest p value first
X_train9 = X_train8.drop("room_type_reserved_Room_Type 6", axis=1)
X_train9
| no_of_children | no_of_week_nights | required_car_parking_space | lead_time | arrival_month | arrival_date | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | type_of_meal_plan_Meal Plan 2 | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | market_segment_type_Corporate | market_segment_type_Offline | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13662 | 0 | 1 | 0 | 163 | 10 | 15 | 0 | 0 | 0 | 115.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 26641 | 0 | 3 | 0 | 113 | 3 | 31 | 0 | 0 | 0 | 78.15 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 17835 | 0 | 3 | 0 | 359 | 10 | 14 | 0 | 0 | 0 | 78.00 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 21485 | 0 | 3 | 0 | 136 | 6 | 29 | 0 | 0 | 0 | 85.50 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 5670 | 0 | 2 | 0 | 21 | 8 | 15 | 0 | 0 | 0 | 151.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7813 | 0 | 1 | 0 | 66 | 11 | 12 | 0 | 0 | 0 | 105.33 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 32511 | 0 | 2 | 0 | 70 | 4 | 22 | 0 | 0 | 0 | 105.30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5192 | 0 | 2 | 0 | 24 | 6 | 6 | 0 | 0 | 0 | 120.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 12172 | 2 | 1 | 0 | 3 | 3 | 21 | 0 | 0 | 0 | 181.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 33003 | 0 | 3 | 0 | 222 | 8 | 31 | 0 | 0 | 0 | 96.30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
25392 rows × 18 columns
# fitting the model on training set
logit9 = sm.Logit(y_train, X_train9.astype(float))
lg9 = logit9.fit()
pred_train9 = lg9.predict(X_train9)
pred_train9 = np.round(pred_train9)
print("Accuracy on training set : ", accuracy_score(y_train, pred_train9))
Optimization terminated successfully.
Current function value: 0.444670
Iterations 11
Accuracy on training set : 0.7923361688720857
# summary of initial logistic regression model
print(lg9.summary())
Logit Regression Results
=======================================================================================
Dep. Variable: booking_status_Not_Canceled No. Observations: 25392
Model: Logit Df Residuals: 25374
Method: MLE Df Model: 17
Date: Fri, 25 Feb 2022 Pseudo R-squ.: 0.2983
Time: 23:29:03 Log-Likelihood: -11291.
converged: True LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
no_of_children -0.1596 0.044 -3.663 0.000 -0.245 -0.074
no_of_week_nights 0.0604 0.011 5.358 0.000 0.038 0.083
required_car_parking_space 1.5817 0.135 11.743 0.000 1.318 1.846
lead_time -0.0152 0.000 -61.978 0.000 -0.016 -0.015
arrival_month 0.1132 0.005 21.087 0.000 0.103 0.124
arrival_date 0.0182 0.002 10.232 0.000 0.015 0.022
repeated_guest 2.7727 0.622 4.456 0.000 1.553 3.992
no_of_previous_cancellations -0.2289 0.083 -2.765 0.006 -0.391 -0.067
no_of_previous_bookings_not_canceled 0.1921 0.167 1.152 0.249 -0.135 0.519
avg_price_per_room -0.0064 0.000 -14.636 0.000 -0.007 -0.006
no_of_special_requests 1.4706 0.029 50.593 0.000 1.414 1.528
type_of_meal_plan_Meal Plan 2 -0.3036 0.061 -4.940 0.000 -0.424 -0.183
type_of_meal_plan_Not Selected 0.0830 0.049 1.698 0.089 -0.013 0.179
room_type_reserved_Room_Type 2 0.7536 0.123 6.108 0.000 0.512 0.995
room_type_reserved_Room_Type 4 0.0618 0.049 1.251 0.211 -0.035 0.159
room_type_reserved_Room_Type 5 0.3038 0.204 1.490 0.136 -0.096 0.704
market_segment_type_Corporate 1.4724 0.097 15.225 0.000 1.283 1.662
market_segment_type_Offline 2.2360 0.049 45.673 0.000 2.140 2.332
========================================================================================================
# dropping highest p value first
X_train10 = X_train9.drop("no_of_previous_bookings_not_canceled", axis=1)
X_train10
| no_of_children | no_of_week_nights | required_car_parking_space | lead_time | arrival_month | arrival_date | repeated_guest | no_of_previous_cancellations | avg_price_per_room | no_of_special_requests | type_of_meal_plan_Meal Plan 2 | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | market_segment_type_Corporate | market_segment_type_Offline | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13662 | 0 | 1 | 0 | 163 | 10 | 15 | 0 | 0 | 115.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 26641 | 0 | 3 | 0 | 113 | 3 | 31 | 0 | 0 | 78.15 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 17835 | 0 | 3 | 0 | 359 | 10 | 14 | 0 | 0 | 78.00 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 21485 | 0 | 3 | 0 | 136 | 6 | 29 | 0 | 0 | 85.50 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 5670 | 0 | 2 | 0 | 21 | 8 | 15 | 0 | 0 | 151.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7813 | 0 | 1 | 0 | 66 | 11 | 12 | 0 | 0 | 105.33 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 32511 | 0 | 2 | 0 | 70 | 4 | 22 | 0 | 0 | 105.30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5192 | 0 | 2 | 0 | 24 | 6 | 6 | 0 | 0 | 120.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 12172 | 2 | 1 | 0 | 3 | 3 | 21 | 0 | 0 | 181.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 33003 | 0 | 3 | 0 | 222 | 8 | 31 | 0 | 0 | 96.30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
25392 rows × 17 columns
# fitting the model on training set
logit10 = sm.Logit(y_train, X_train10.astype(float))
lg10 = logit10.fit()
pred_train10 = lg10.predict(X_train10)
pred_train10 = np.round(pred_train10)
print("Accuracy on training set : ", accuracy_score(y_train, pred_train10))
Optimization terminated successfully.
Current function value: 0.444727
Iterations 9
Accuracy on training set : 0.7922574039067423
# summary of new logistic regression model
print(lg10.summary())
Logit Regression Results
=======================================================================================
Dep. Variable: booking_status_Not_Canceled No. Observations: 25392
Model: Logit Df Residuals: 25375
Method: MLE Df Model: 16
Date: Fri, 25 Feb 2022 Pseudo R-squ.: 0.2982
Time: 23:29:03 Log-Likelihood: -11293.
converged: True LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
==================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------
no_of_children -0.1597 0.044 -3.663 0.000 -0.245 -0.074
no_of_week_nights 0.0605 0.011 5.362 0.000 0.038 0.083
required_car_parking_space 1.5808 0.135 11.735 0.000 1.317 1.845
lead_time -0.0152 0.000 -62.019 0.000 -0.016 -0.015
arrival_month 0.1131 0.005 21.077 0.000 0.103 0.124
arrival_date 0.0182 0.002 10.248 0.000 0.015 0.022
repeated_guest 3.1906 0.556 5.736 0.000 2.100 4.281
no_of_previous_cancellations -0.1933 0.075 -2.564 0.010 -0.341 -0.046
avg_price_per_room -0.0064 0.000 -14.640 0.000 -0.007 -0.006
no_of_special_requests 1.4712 0.029 50.619 0.000 1.414 1.528
type_of_meal_plan_Meal Plan 2 -0.3030 0.061 -4.931 0.000 -0.423 -0.183
type_of_meal_plan_Not Selected 0.0831 0.049 1.701 0.089 -0.013 0.179
room_type_reserved_Room_Type 2 0.7539 0.123 6.110 0.000 0.512 0.996
room_type_reserved_Room_Type 4 0.0619 0.049 1.253 0.210 -0.035 0.159
room_type_reserved_Room_Type 5 0.3043 0.204 1.492 0.136 -0.095 0.704
market_segment_type_Corporate 1.4766 0.097 15.265 0.000 1.287 1.666
market_segment_type_Offline 2.2362 0.049 45.679 0.000 2.140 2.332
==================================================================================================
# dropping highest p value first
X_train11 = X_train10.drop("room_type_reserved_Room_Type 4", axis=1)
X_train11
| no_of_children | no_of_week_nights | required_car_parking_space | lead_time | arrival_month | arrival_date | repeated_guest | no_of_previous_cancellations | avg_price_per_room | no_of_special_requests | type_of_meal_plan_Meal Plan 2 | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 5 | market_segment_type_Corporate | market_segment_type_Offline | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13662 | 0 | 1 | 0 | 163 | 10 | 15 | 0 | 0 | 115.00 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 26641 | 0 | 3 | 0 | 113 | 3 | 31 | 0 | 0 | 78.15 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 17835 | 0 | 3 | 0 | 359 | 10 | 14 | 0 | 0 | 78.00 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 21485 | 0 | 3 | 0 | 136 | 6 | 29 | 0 | 0 | 85.50 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 5670 | 0 | 2 | 0 | 21 | 8 | 15 | 0 | 0 | 151.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7813 | 0 | 1 | 0 | 66 | 11 | 12 | 0 | 0 | 105.33 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 32511 | 0 | 2 | 0 | 70 | 4 | 22 | 0 | 0 | 105.30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5192 | 0 | 2 | 0 | 24 | 6 | 6 | 0 | 0 | 120.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 12172 | 2 | 1 | 0 | 3 | 3 | 21 | 0 | 0 | 181.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 33003 | 0 | 3 | 0 | 222 | 8 | 31 | 0 | 0 | 96.30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
25392 rows × 16 columns
# fitting the model on training set
logit11 = sm.Logit(y_train, X_train11.astype(float))
lg11 = logit11.fit()
pred_train11 = lg11.predict(X_train11)
pred_train11 = np.round(pred_train11)
print("Accuracy on training set : ", accuracy_score(y_train, pred_train11))
Optimization terminated successfully.
Current function value: 0.444758
Iterations 9
Accuracy on training set : 0.7925330812854442
# summary of new logistic regression model
print(lg11.summary())
Logit Regression Results
=======================================================================================
Dep. Variable: booking_status_Not_Canceled No. Observations: 25392
Model: Logit Df Residuals: 25376
Method: MLE Df Model: 15
Date: Fri, 25 Feb 2022 Pseudo R-squ.: 0.2982
Time: 23:29:03 Log-Likelihood: -11293.
converged: True LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
==================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------
no_of_children -0.1727 0.042 -4.083 0.000 -0.256 -0.090
no_of_week_nights 0.0616 0.011 5.480 0.000 0.040 0.084
required_car_parking_space 1.5794 0.135 11.723 0.000 1.315 1.843
lead_time -0.0153 0.000 -62.122 0.000 -0.016 -0.015
arrival_month 0.1128 0.005 21.051 0.000 0.102 0.123
arrival_date 0.0182 0.002 10.252 0.000 0.015 0.022
repeated_guest 3.1921 0.555 5.748 0.000 2.104 4.280
no_of_previous_cancellations -0.1937 0.075 -2.571 0.010 -0.341 -0.046
avg_price_per_room -0.0062 0.000 -15.127 0.000 -0.007 -0.005
no_of_special_requests 1.4716 0.029 50.636 0.000 1.415 1.529
type_of_meal_plan_Meal Plan 2 -0.3078 0.061 -5.019 0.000 -0.428 -0.188
type_of_meal_plan_Not Selected 0.0669 0.047 1.419 0.156 -0.025 0.159
room_type_reserved_Room_Type 2 0.7470 0.123 6.061 0.000 0.505 0.989
room_type_reserved_Room_Type 5 0.2899 0.204 1.424 0.154 -0.109 0.689
market_segment_type_Corporate 1.4642 0.096 15.220 0.000 1.276 1.653
market_segment_type_Offline 2.2242 0.048 46.367 0.000 2.130 2.318
==================================================================================================
# dropping highest p value first
X_train12 = X_train11.drop("type_of_meal_plan_Not Selected", axis=1)
X_train12
| no_of_children | no_of_week_nights | required_car_parking_space | lead_time | arrival_month | arrival_date | repeated_guest | no_of_previous_cancellations | avg_price_per_room | no_of_special_requests | type_of_meal_plan_Meal Plan 2 | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 5 | market_segment_type_Corporate | market_segment_type_Offline | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13662 | 0 | 1 | 0 | 163 | 10 | 15 | 0 | 0 | 115.00 | 0 | 0 | 0 | 0 | 0 | 1 |
| 26641 | 0 | 3 | 0 | 113 | 3 | 31 | 0 | 0 | 78.15 | 1 | 0 | 1 | 0 | 0 | 0 |
| 17835 | 0 | 3 | 0 | 359 | 10 | 14 | 0 | 0 | 78.00 | 1 | 0 | 0 | 0 | 0 | 1 |
| 21485 | 0 | 3 | 0 | 136 | 6 | 29 | 0 | 0 | 85.50 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5670 | 0 | 2 | 0 | 21 | 8 | 15 | 0 | 0 | 151.00 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7813 | 0 | 1 | 0 | 66 | 11 | 12 | 0 | 0 | 105.33 | 1 | 0 | 0 | 0 | 0 | 0 |
| 32511 | 0 | 2 | 0 | 70 | 4 | 22 | 0 | 0 | 105.30 | 1 | 0 | 0 | 0 | 0 | 0 |
| 5192 | 0 | 2 | 0 | 24 | 6 | 6 | 0 | 0 | 120.00 | 0 | 0 | 0 | 0 | 0 | 0 |
| 12172 | 2 | 1 | 0 | 3 | 3 | 21 | 0 | 0 | 181.00 | 0 | 0 | 0 | 0 | 0 | 0 |
| 33003 | 0 | 3 | 0 | 222 | 8 | 31 | 0 | 0 | 96.30 | 1 | 0 | 0 | 0 | 0 | 0 |
25392 rows × 15 columns
# fitting the model on training set
logit12 = sm.Logit(y_train, X_train12.astype(float))
lg12 = logit12.fit()
pred_train12 = lg12.predict(X_train12)
pred_train12 = np.round(pred_train12)
print("Accuracy on training set : ", accuracy_score(y_train, pred_train12))
Optimization terminated successfully.
Current function value: 0.444798
Iterations 9
Accuracy on training set : 0.7923755513547575
# summary of new logistic regression model
print(lg12.summary())
Logit Regression Results
=======================================================================================
Dep. Variable: booking_status_Not_Canceled No. Observations: 25392
Model: Logit Df Residuals: 25377
Method: MLE Df Model: 14
Date: Fri, 25 Feb 2022 Pseudo R-squ.: 0.2981
Time: 23:29:03 Log-Likelihood: -11294.
converged: True LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
==================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------
no_of_children -0.1792 0.042 -4.259 0.000 -0.262 -0.097
no_of_week_nights 0.0617 0.011 5.485 0.000 0.040 0.084
required_car_parking_space 1.5790 0.135 11.715 0.000 1.315 1.843
lead_time -0.0153 0.000 -62.227 0.000 -0.016 -0.015
arrival_month 0.1141 0.005 21.566 0.000 0.104 0.124
arrival_date 0.0185 0.002 10.514 0.000 0.015 0.022
repeated_guest 3.1874 0.555 5.744 0.000 2.100 4.275
no_of_previous_cancellations -0.1929 0.075 -2.563 0.010 -0.340 -0.045
avg_price_per_room -0.0062 0.000 -15.099 0.000 -0.007 -0.005
no_of_special_requests 1.4709 0.029 50.628 0.000 1.414 1.528
type_of_meal_plan_Meal Plan 2 -0.3131 0.061 -5.115 0.000 -0.433 -0.193
room_type_reserved_Room_Type 2 0.7426 0.123 6.027 0.000 0.501 0.984
room_type_reserved_Room_Type 5 0.2842 0.203 1.397 0.162 -0.115 0.683
market_segment_type_Corporate 1.4518 0.096 15.151 0.000 1.264 1.640
market_segment_type_Offline 2.2139 0.047 46.716 0.000 2.121 2.307
==================================================================================================
# dropping highest p value first
X_train13 = X_train12.drop("room_type_reserved_Room_Type 5", axis=1)
X_train13
| no_of_children | no_of_week_nights | required_car_parking_space | lead_time | arrival_month | arrival_date | repeated_guest | no_of_previous_cancellations | avg_price_per_room | no_of_special_requests | type_of_meal_plan_Meal Plan 2 | room_type_reserved_Room_Type 2 | market_segment_type_Corporate | market_segment_type_Offline | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13662 | 0 | 1 | 0 | 163 | 10 | 15 | 0 | 0 | 115.00 | 0 | 0 | 0 | 0 | 1 |
| 26641 | 0 | 3 | 0 | 113 | 3 | 31 | 0 | 0 | 78.15 | 1 | 0 | 1 | 0 | 0 |
| 17835 | 0 | 3 | 0 | 359 | 10 | 14 | 0 | 0 | 78.00 | 1 | 0 | 0 | 0 | 1 |
| 21485 | 0 | 3 | 0 | 136 | 6 | 29 | 0 | 0 | 85.50 | 0 | 0 | 0 | 0 | 0 |
| 5670 | 0 | 2 | 0 | 21 | 8 | 15 | 0 | 0 | 151.00 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7813 | 0 | 1 | 0 | 66 | 11 | 12 | 0 | 0 | 105.33 | 1 | 0 | 0 | 0 | 0 |
| 32511 | 0 | 2 | 0 | 70 | 4 | 22 | 0 | 0 | 105.30 | 1 | 0 | 0 | 0 | 0 |
| 5192 | 0 | 2 | 0 | 24 | 6 | 6 | 0 | 0 | 120.00 | 0 | 0 | 0 | 0 | 0 |
| 12172 | 2 | 1 | 0 | 3 | 3 | 21 | 0 | 0 | 181.00 | 0 | 0 | 0 | 0 | 0 |
| 33003 | 0 | 3 | 0 | 222 | 8 | 31 | 0 | 0 | 96.30 | 1 | 0 | 0 | 0 | 0 |
25392 rows × 14 columns
# fitting the model on training set
logit13 = sm.Logit(y_train, X_train13.astype(float))
lg13 = logit13.fit()
pred_train13 = lg13.predict(X_train13)
pred_train13 = np.round(pred_train13)
print("Accuracy on training set : ", accuracy_score(y_train, pred_train13))
Optimization terminated successfully.
Current function value: 0.444837
Iterations 9
Accuracy on training set : 0.7924149338374291
# summary of new logistic regression model
print(lg13.summary())
Logit Regression Results
=======================================================================================
Dep. Variable: booking_status_Not_Canceled No. Observations: 25392
Model: Logit Df Residuals: 25378
Method: MLE Df Model: 13
Date: Fri, 25 Feb 2022 Pseudo R-squ.: 0.2981
Time: 23:29:03 Log-Likelihood: -11295.
converged: True LL-Null: -16091.
Covariance Type: nonrobust LLR p-value: 0.000
==================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------
no_of_children -0.1789 0.042 -4.254 0.000 -0.261 -0.096
no_of_week_nights 0.0616 0.011 5.483 0.000 0.040 0.084
required_car_parking_space 1.5774 0.135 11.706 0.000 1.313 1.842
lead_time -0.0153 0.000 -62.227 0.000 -0.016 -0.015
arrival_month 0.1141 0.005 21.571 0.000 0.104 0.124
arrival_date 0.0185 0.002 10.492 0.000 0.015 0.022
repeated_guest 3.1857 0.555 5.743 0.000 2.098 4.273
no_of_previous_cancellations -0.1929 0.075 -2.564 0.010 -0.340 -0.045
avg_price_per_room -0.0062 0.000 -15.050 0.000 -0.007 -0.005
no_of_special_requests 1.4692 0.029 50.635 0.000 1.412 1.526
type_of_meal_plan_Meal Plan 2 -0.3139 0.061 -5.128 0.000 -0.434 -0.194
room_type_reserved_Room_Type 2 0.7413 0.123 6.018 0.000 0.500 0.983
market_segment_type_Corporate 1.4633 0.096 15.283 0.000 1.276 1.651
market_segment_type_Offline 2.2138 0.047 46.727 0.000 2.121 2.307
==================================================================================================
# converting coefficients to odds
odds = np.exp(lg13.params)
# adding the odds to a dataframe
pd.DataFrame(odds, X_train13.columns, columns=["odds"]).T
| no_of_children | no_of_week_nights | required_car_parking_space | lead_time | arrival_month | arrival_date | repeated_guest | no_of_previous_cancellations | avg_price_per_room | no_of_special_requests | type_of_meal_plan_Meal Plan 2 | room_type_reserved_Room_Type 2 | market_segment_type_Corporate | market_segment_type_Offline | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| odds | 0.836176 | 1.063582 | 4.84254 | 0.984844 | 1.12084 | 1.018663 | 24.184352 | 0.82458 | 0.993832 | 4.345588 | 0.730606 | 2.098681 | 4.320259 | 9.150489 |
Our y variable is booking status not cancelled which = 1 when customer does not cancel.
Repeated guests have a 24.18 times chance of not cancelling their reservation.
# finding the percentage change
perc_change_odds = (np.exp(lg13.params) - 1) * 100
# adding the change_odds% to a dataframe
pd.DataFrame(perc_change_odds, X_train13.columns, columns=["change_odds%"])
| change_odds% | |
|---|---|
| no_of_children | -16.382441 |
| no_of_week_nights | 6.358237 |
| required_car_parking_space | 384.254014 |
| lead_time | -1.515592 |
| arrival_month | 12.084044 |
| arrival_date | 1.866251 |
| repeated_guest | 2318.435236 |
| no_of_previous_cancellations | -17.541977 |
| avg_price_per_room | -0.616831 |
| no_of_special_requests | 334.558754 |
| type_of_meal_plan_Meal Plan 2 | -26.939407 |
| room_type_reserved_Room_Type 2 | 109.868086 |
| market_segment_type_Corporate | 332.025914 |
| market_segment_type_Offline | 815.048879 |
Another way of viewing this data, is to see that, for example, customers who reserve their room through the offline market segment type have a 815.05% chance of not cancelling, so most likely this customer will not cancel their reservation.
cm = confusion_matrix(y_train, pred_train13)
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()
We decreased the conflicts within the model by deleting misleading variables. The model also improves accuracy, when comparing to the original confusion matrix constructed.
# evaluate the accuracy
print("Accuracy on training set : ", accuracy_score(y_train, pred_train13))
Accuracy on training set : 0.7924149338374291
logit_roc_auc_train = roc_auc_score(y_train, lg13.predict(X_train13))
fpr, tpr, thresholds = roc_curve(y_train, lg13.predict(X_train13))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
The model is performing well on the training set, given the area under the curve as well as the shape.
# dropping variables from test set as well which were dropped from training set
X_test = X_test.drop(["arrival_year", "market_segment_type_Online", "no_of_adults", "no_of_weekend_nights", "type_of_meal_plan_Meal Plan 3", "market_segment_type_Complementary", "room_type_reserved_Room_Type 3", "room_type_reserved_Room_Type 7", "room_type_reserved_Room_Type 6", "no_of_previous_bookings_not_canceled", "room_type_reserved_Room_Type 4", "type_of_meal_plan_Not Selected", "room_type_reserved_Room_Type 5"], axis=1)
pred_test = lg13.predict(X_test) > 0.5
pred_test = np.round(pred_test)
print("Accuracy on training set : ", accuracy_score(y_train, pred_train13))
print("Accuracy on test set : ", accuracy_score(y_test, pred_test))
Accuracy on training set : 0.7924149338374291 Accuracy on test set : 0.7974823118625379
The training set are close together in value and predict almost 80% of the data, which means they are performing very well under this machine learning logistic regression model.
# Splitting data into training and test set:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
print(X_train.shape, X_test.shape)
(25392, 27) (10883, 27)
dTree = DecisionTreeClassifier(criterion = 'gini', random_state=1)
dTree.fit(X_train, y_train)
DecisionTreeClassifier(random_state=1)
print("Accuracy on training set : ",dTree.score(X_train, y_train))
print("Accuracy on test set : ",dTree.score(X_test, y_test))
Accuracy on training set : 0.994210775047259 Accuracy on test set : 0.8696131581365433
#Checking number of bookings that were not cancelled
y.sum(axis = 0)
24390
Accuracy is not a good indicator here since the majority of bookings were not cancelled.
Since we don't want people to cancel their bookings we should use Recall as a metric of model evaluation instead of accuracy.
## Function to create confusion matrix
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(X_test)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
# the rest of the code pretties up the graph
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
## Function to calculate recall score
def get_recall_score(model):
'''
model : classifier to predict values of X
'''
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
print("Recall on training set : ",metrics.recall_score(y_train,pred_train))
print("Recall on test set : ",metrics.recall_score(y_test,pred_test))
make_confusion_matrix(dTree,y_test)
2864 guests cancelled their booking when predicted 6600 guests did not cancel their booking when predicted that they would not cancel their bookings
# Recall on train and test
get_recall_score(dTree)
Recall on training set : 0.9957719184919842 Recall on test set : 0.8966173074310556
feature_names = list(X.columns)
print(feature_names)
['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'required_car_parking_space', 'lead_time', 'arrival_year', 'arrival_month', 'arrival_date', 'repeated_guest', 'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled', 'avg_price_per_room', 'no_of_special_requests', 'type_of_meal_plan_Meal Plan 2', 'type_of_meal_plan_Meal Plan 3', 'type_of_meal_plan_Not Selected', 'room_type_reserved_Room_Type 2', 'room_type_reserved_Room_Type 3', 'room_type_reserved_Room_Type 4', 'room_type_reserved_Room_Type 5', 'room_type_reserved_Room_Type 6', 'room_type_reserved_Room_Type 7', 'market_segment_type_Complementary', 'market_segment_type_Corporate', 'market_segment_type_Offline', 'market_segment_type_Online']
plt.figure(figsize=(20,30))
tree.plot_tree(dTree,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
# calculating the importance of features
print (pd.DataFrame(dTree.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp lead_time 0.350353 avg_price_per_room 0.177265 market_segment_type_Online 0.092345 arrival_date 0.085259 no_of_special_requests 0.067948 arrival_month 0.064570 no_of_week_nights 0.045395 no_of_weekend_nights 0.038528 no_of_adults 0.027482 arrival_year 0.011793 type_of_meal_plan_Not Selected 0.008234 room_type_reserved_Room_Type 4 0.006904 required_car_parking_space 0.006878 market_segment_type_Offline 0.003927 type_of_meal_plan_Meal Plan 2 0.003704 no_of_children 0.003672 room_type_reserved_Room_Type 5 0.001681 room_type_reserved_Room_Type 2 0.001486 market_segment_type_Corporate 0.000646 repeated_guest 0.000601 room_type_reserved_Room_Type 6 0.000582 room_type_reserved_Room_Type 7 0.000566 no_of_previous_bookings_not_canceled 0.000091 no_of_previous_cancellations 0.000091 type_of_meal_plan_Meal Plan 3 0.000000 room_type_reserved_Room_Type 3 0.000000 market_segment_type_Complementary 0.000000
# taking same tree and plotting it
importances = dTree.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
dTree1 = DecisionTreeClassifier(criterion = 'gini',max_depth=3,random_state=1)
dTree1.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=3, random_state=1)
make_confusion_matrix(dTree1, y_test)
# Accuracy on train and test
print("Accuracy on training set : ",dTree1.score(X_train, y_train))
print("Accuracy on test set : ",dTree1.score(X_test, y_test))
# Recall on train and test
get_recall_score(dTree1)
Accuracy on training set : 0.7844202898550725 Accuracy on test set : 0.7913259211614444 Recall on training set : 0.8103822890363498 Recall on test set : 0.8166010052981931
Recall on training set has reduced from 1 to 0.81 but this is an improvement because now the model is not overfitting and we have a generalized model.
# 3 splits only
plt.figure(figsize=(15,10))
tree.plot_tree(dTree1,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(dTree1,feature_names=feature_names,show_weights=True))
|--- lead_time <= 151.50 | |--- no_of_special_requests <= 0.50 | | |--- market_segment_type_Online <= 0.50 | | | |--- weights: [781.00, 4614.00] class: 1 | | |--- market_segment_type_Online > 0.50 | | | |--- weights: [2768.00, 2504.00] class: 0 | |--- no_of_special_requests > 0.50 | | |--- no_of_special_requests <= 1.50 | | | |--- weights: [1055.00, 5624.00] class: 1 | | |--- no_of_special_requests > 1.50 | | | |--- weights: [145.00, 2919.00] class: 1 |--- lead_time > 151.50 | |--- avg_price_per_room <= 100.04 | | |--- no_of_special_requests <= 0.50 | | | |--- weights: [1242.00, 694.00] class: 0 | | |--- no_of_special_requests > 0.50 | | | |--- weights: [249.00, 586.00] class: 1 | |--- avg_price_per_room > 100.04 | | |--- arrival_month <= 11.50 | | | |--- weights: [2108.00, 31.00] class: 0 | | |--- arrival_month > 11.50 | | | |--- weights: [15.00, 57.00] class: 1
print (pd.DataFrame(dTree1.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp lead_time 0.501907 market_segment_type_Online 0.199054 no_of_special_requests 0.164194 avg_price_per_room 0.113125 arrival_month 0.021719 no_of_week_nights 0.000000 type_of_meal_plan_Not Selected 0.000000 market_segment_type_Offline 0.000000 market_segment_type_Corporate 0.000000 market_segment_type_Complementary 0.000000 room_type_reserved_Room_Type 7 0.000000 room_type_reserved_Room_Type 6 0.000000 room_type_reserved_Room_Type 5 0.000000 room_type_reserved_Room_Type 4 0.000000 room_type_reserved_Room_Type 3 0.000000 room_type_reserved_Room_Type 2 0.000000 type_of_meal_plan_Meal Plan 3 0.000000 required_car_parking_space 0.000000 type_of_meal_plan_Meal Plan 2 0.000000 no_of_children 0.000000 no_of_previous_bookings_not_canceled 0.000000 no_of_previous_cancellations 0.000000 repeated_guest 0.000000 arrival_date 0.000000 arrival_year 0.000000 no_of_weekend_nights 0.000000 no_of_adults 0.000000
# only a few features listed because there are only 3 splits
importances = dTree1.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(10,10))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
You can see in important features of previous model, average price per room was above market segment Online, market segment Online is above the average price per room. this is the shortcoming of pre pruning, we just limit it even before knowing the importance of features and split.
That's why we will go for pre pruning using grid search, maybe setting max_depth to 3 is not good enough
from sklearn.model_selection import GridSearchCV
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
## add from article
# try different values for each of these parameters
# to find the best fit
# hyperparameter tuning must be done carefully
parameters = {'max_depth': np.arange(1,10),
'min_samples_leaf': [1, 2, 5, 7, 10,15,20],
'max_leaf_nodes' : [2, 3, 5, 10],
'min_impurity_decrease': [0.001,0.01,0.1]
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
# cross validation of 5 sets
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
# specific decision trees based on the defined estimator
estimator.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=1, max_leaf_nodes=2, min_impurity_decrease=0.1,
random_state=1)
make_confusion_matrix(estimator,y_test)
# Accuracy on train and test
print("Accuracy on training set : ",estimator.score(X_train, y_train))
print("Accuracy on test set : ",estimator.score(X_test, y_test))
# Recall on train and test
get_recall_score(estimator)
Accuracy on training set : 0.6706442974165091 Accuracy on test set : 0.6763759992649085 Recall on training set : 1.0 Recall on test set : 1.0
Testing performs slightly better than the training set!
plt.figure(figsize=(15,10))
tree.plot_tree(estimator,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator,feature_names=feature_names,show_weights=True))
|--- weights: [8363.00, 17029.00] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print (pd.DataFrame(estimator.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp no_of_adults 0.0 type_of_meal_plan_Meal Plan 2 0.0 market_segment_type_Offline 0.0 market_segment_type_Corporate 0.0 market_segment_type_Complementary 0.0 room_type_reserved_Room_Type 7 0.0 room_type_reserved_Room_Type 6 0.0 room_type_reserved_Room_Type 5 0.0 room_type_reserved_Room_Type 4 0.0 room_type_reserved_Room_Type 3 0.0 room_type_reserved_Room_Type 2 0.0 type_of_meal_plan_Not Selected 0.0 type_of_meal_plan_Meal Plan 3 0.0 no_of_special_requests 0.0 no_of_children 0.0 avg_price_per_room 0.0 no_of_previous_bookings_not_canceled 0.0 no_of_previous_cancellations 0.0 repeated_guest 0.0 arrival_date 0.0 arrival_month 0.0 arrival_year 0.0 lead_time 0.0 required_car_parking_space 0.0 no_of_week_nights 0.0 no_of_weekend_nights 0.0 market_segment_type_Online 0.0
Here we see that the importance of features have all decreased to zero. This model is severely underfit.
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000e+00 | 0.007572 |
| 1 | 4.327745e-07 | 0.007573 |
| 2 | 4.688391e-07 | 0.007573 |
| 3 | 5.329960e-07 | 0.007574 |
| 4 | 6.133547e-07 | 0.007575 |
| ... | ... | ... |
| 1340 | 6.665684e-03 | 0.286897 |
| 1341 | 1.304480e-02 | 0.299942 |
| 1342 | 1.725993e-02 | 0.317202 |
| 1343 | 2.399048e-02 | 0.365183 |
| 1344 | 7.657789e-02 | 0.441761 |
1345 rows × 2 columns
This has 1345 trees.
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train)
clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]))
Number of nodes in the last tree is: 1 with ccp_alpha: 0.07657789477371368
For the remainder, we remove the last element in
clfs and ccp_alphas, because it is the trivial tree with only one
node. Here we show that the number of nodes and tree depth decreases as alpha
increases.
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1,figsize=(10,7))
ax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker='o', drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
Text(0.5, 1.0, 'Depth vs alpha')
As alpha changes the tree becomes simpler and depth decreases.
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
fig, ax = plt.subplots(figsize=(10,5))
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train",
drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
Train and test are both good fit, even though test doesn't reach the full 1.00 accuracy point.
index_best_model = np.argmax(test_scores)
best_model = clfs[index_best_model]
print(best_model)
print('Training accuracy of best model: ',best_model.score(X_train, y_train))
print('Test accuracy of best model: ',best_model.score(X_test, y_test))
DecisionTreeClassifier(ccp_alpha=0.00011736595788977046, random_state=1) Training accuracy of best model: 0.9029615626969124 Test accuracy of best model: 0.883304235964348
# find best in terms of recall
recall_train=[]
for clf in clfs:
pred_train3=clf.predict(X_train)
values_train=metrics.recall_score(y_train,pred_train3)
recall_train.append(values_train)
recall_test=[]
for clf in clfs:
pred_test3=clf.predict(X_test)
values_test=metrics.recall_score(y_test,pred_test3)
recall_test.append(values_test)
recall_test=[]
for clf in clfs:
pred_test3=clf.predict(X_test)
values_test=metrics.recall_score(y_test,pred_test3)
recall_test.append(values_test)
fig, ax = plt.subplots(figsize=(15,5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker='o', label="train",
drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
<matplotlib.legend.Legend at 0x7f9fdd062ac0>
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.00012717831407730315, random_state=1)
make_confusion_matrix(best_model,y_test)
# Recall on train and test
get_recall_score(best_model)
Recall on training set : 0.9417464325562276 Recall on test set : 0.9318027441923652
With post-pruning we get the highest recall on the test set and closest values of train to test set.
plt.figure(figsize=(17,15))
tree.plot_tree(best_model,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(best_model,feature_names=feature_names,show_weights=True))
|--- lead_time <= 151.50 | |--- no_of_special_requests <= 0.50 | | |--- market_segment_type_Online <= 0.50 | | | |--- lead_time <= 90.50 | | | | |--- avg_price_per_room <= 201.50 | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | |--- lead_time <= 16.50 | | | | | | | | |--- weights: [43.00, 558.00] class: 1 | | | | | | | |--- lead_time > 16.50 | | | | | | | | |--- avg_price_per_room <= 135.00 | | | | | | | | | |--- weights: [36.00, 162.00] class: 1 | | | | | | | | |--- avg_price_per_room > 135.00 | | | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | |--- weights: [0.00, 1609.00] class: 1 | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | |--- lead_time <= 68.50 | | | | | | | |--- no_of_weekend_nights <= 4.50 | | | | | | | | |--- lead_time <= 1.50 | | | | | | | | | |--- arrival_date <= 27.50 | | | | | | | | | | |--- weights: [5.00, 95.00] class: 1 | | | | | | | | | |--- arrival_date > 27.50 | | | | | | | | | | |--- arrival_month <= 2.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- arrival_month > 2.50 | | | | | | | | | | | |--- weights: [2.00, 10.00] class: 1 | | | | | | | | |--- lead_time > 1.50 | | | | | | | | | |--- arrival_month <= 9.50 | | | | | | | | | | |--- lead_time <= 59.50 | | | | | | | | | | | |--- weights: [68.00, 616.00] class: 1 | | | | | | | | | | |--- lead_time > 59.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | |--- arrival_month > 9.50 | | | | | | | | | | |--- weights: [16.00, 524.00] class: 1 | | | | | | | |--- no_of_weekend_nights > 4.50 | | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | | | | | |--- lead_time > 68.50 | | | | | | | |--- avg_price_per_room <= 99.98 | | | | | | | | |--- arrival_month <= 3.50 | | | | | | | | | |--- avg_price_per_room <= 62.50 | | | | | | | | | | |--- weights: [0.00, 21.00] class: 1 | | | | | | | | | |--- avg_price_per_room > 62.50 | | | | | | | | | | |--- lead_time <= 77.00 | | | | | | | | | | | |--- weights: [9.00, 1.00] class: 0 | | | | | | | | | | |--- lead_time > 77.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | |--- arrival_month > 3.50 | | | | | | | | | |--- weights: [8.00, 103.00] class: 1 | | | | | | | |--- avg_price_per_room > 99.98 | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | |--- weights: [52.00, 1.00] class: 0 | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | |--- arrival_date <= 23.50 | | | | | | | | | | |--- weights: [5.00, 28.00] class: 1 | | | | | | | | | |--- arrival_date > 23.50 | | | | | | | | | | |--- weights: [24.00, 4.00] class: 0 | | | | |--- avg_price_per_room > 201.50 | | | | | |--- arrival_date <= 28.00 | | | | | | |--- weights: [17.00, 0.00] class: 0 | | | | | |--- arrival_date > 28.00 | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | |--- lead_time > 90.50 | | | | |--- lead_time <= 117.50 | | | | | |--- avg_price_per_room <= 93.58 | | | | | | |--- arrival_date <= 6.50 | | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | |--- avg_price_per_room <= 80.38 | | | | | | | | | | |--- weights: [70.00, 3.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 80.38 | | | | | | | | | | |--- weights: [1.00, 3.00] class: 1 | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | |--- weights: [1.00, 5.00] class: 1 | | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | | |--- weights: [2.00, 39.00] class: 1 | | | | | | |--- arrival_date > 6.50 | | | | | | | |--- avg_price_per_room <= 66.50 | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | |--- weights: [1.00, 26.00] class: 1 | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | |--- avg_price_per_room <= 58.75 | | | | | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | | | | | | |--- avg_price_per_room > 58.75 | | | | | | | | | | |--- lead_time <= 97.50 | | | | | | | | | | | |--- weights: [1.00, 3.00] class: 1 | | | | | | | | | | |--- lead_time > 97.50 | | | | | | | | | | | |--- weights: [39.00, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 66.50 | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 <= 0.50 | | | | | | | | | |--- arrival_date <= 29.50 | | | | | | | | | | |--- weights: [19.00, 180.00] class: 1 | | | | | | | | | |--- arrival_date > 29.50 | | | | | | | | | | |--- lead_time <= 96.00 | | | | | | | | | | | |--- weights: [0.00, 8.00] class: 1 | | | | | | | | | | |--- lead_time > 96.00 | | | | | | | | | | | |--- weights: [7.00, 2.00] class: 0 | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 > 0.50 | | | | | | | | | |--- avg_price_per_room <= 82.50 | | | | | | | | | | |--- weights: [7.00, 1.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 82.50 | | | | | | | | | | |--- weights: [2.00, 12.00] class: 1 | | | | | |--- avg_price_per_room > 93.58 | | | | | | |--- arrival_date <= 16.50 | | | | | | | |--- arrival_month <= 7.50 | | | | | | | | |--- weights: [11.00, 36.00] class: 1 | | | | | | | |--- arrival_month > 7.50 | | | | | | | | |--- avg_price_per_room <= 108.50 | | | | | | | | | |--- arrival_date <= 14.50 | | | | | | | | | | |--- weights: [9.00, 10.00] class: 1 | | | | | | | | | |--- arrival_date > 14.50 | | | | | | | | | | |--- weights: [47.00, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 108.50 | | | | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | | | | |--- weights: [0.00, 42.00] class: 1 | | | | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | | | | |--- arrival_date <= 9.50 | | | | | | | | | | | |--- weights: [3.00, 4.00] class: 1 | | | | | | | | | | |--- arrival_date > 9.50 | | | | | | | | | | | |--- weights: [28.00, 1.00] class: 0 | | | | | | |--- arrival_date > 16.50 | | | | | | | |--- arrival_month <= 8.50 | | | | | | | | |--- avg_price_per_room <= 127.39 | | | | | | | | | |--- weights: [83.00, 8.00] class: 0 | | | | | | | | |--- avg_price_per_room > 127.39 | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | |--- arrival_month > 8.50 | | | | | | | | |--- weights: [7.00, 8.00] class: 1 | | | | |--- lead_time > 117.50 | | | | | |--- no_of_week_nights <= 1.50 | | | | | | |--- arrival_date <= 7.50 | | | | | | | |--- weights: [0.00, 51.00] class: 1 | | | | | | |--- arrival_date > 7.50 | | | | | | | |--- avg_price_per_room <= 93.58 | | | | | | | | |--- avg_price_per_room <= 65.38 | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 65.38 | | | | | | | | | |--- weights: [2.00, 33.00] class: 1 | | | | | | | |--- avg_price_per_room > 93.58 | | | | | | | | |--- arrival_date <= 28.00 | | | | | | | | | |--- weights: [48.00, 20.00] class: 0 | | | | | | | | |--- arrival_date > 28.00 | | | | | | | | | |--- weights: [1.00, 13.00] class: 1 | | | | | |--- no_of_week_nights > 1.50 | | | | | | |--- no_of_adults <= 1.50 | | | | | | | |--- weights: [0.00, 113.00] class: 1 | | | | | | |--- no_of_adults > 1.50 | | | | | | | |--- lead_time <= 125.50 | | | | | | | | |--- avg_price_per_room <= 90.85 | | | | | | | | | |--- avg_price_per_room <= 87.50 | | | | | | | | | | |--- weights: [9.00, 18.00] class: 1 | | | | | | | | | |--- avg_price_per_room > 87.50 | | | | | | | | | | |--- weights: [10.00, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 90.85 | | | | | | | | | |--- weights: [0.00, 14.00] class: 1 | | | | | | | |--- lead_time > 125.50 | | | | | | | | |--- weights: [13.00, 161.00] class: 1 | | |--- market_segment_type_Online > 0.50 | | | |--- lead_time <= 13.50 | | | | |--- avg_price_per_room <= 202.67 | | | | | |--- lead_time <= 3.50 | | | | | | |--- arrival_month <= 5.50 | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | |--- weights: [34.00, 229.00] class: 1 | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | |--- arrival_date <= 25.50 | | | | | | | | | |--- weights: [4.00, 21.00] class: 1 | | | | | | | | |--- arrival_date > 25.50 | | | | | | | | | |--- weights: [14.00, 1.00] class: 0 | | | | | | |--- arrival_month > 5.50 | | | | | | | |--- weights: [28.00, 363.00] class: 1 | | | | | |--- lead_time > 3.50 | | | | | | |--- avg_price_per_room <= 99.38 | | | | | | | |--- avg_price_per_room <= 78.90 | | | | | | | | |--- weights: [3.00, 121.00] class: 1 | | | | | | | |--- avg_price_per_room > 78.90 | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | |--- arrival_month <= 1.50 | | | | | | | | | | |--- weights: [0.00, 23.00] class: 1 | | | | | | | | | |--- arrival_month > 1.50 | | | | | | | | | | |--- no_of_week_nights <= 4.50 | | | | | | | | | | | |--- weights: [40.00, 105.00] class: 1 | | | | | | | | | | |--- no_of_week_nights > 4.50 | | | | | | | | | | | |--- weights: [5.00, 1.00] class: 0 | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | |--- weights: [0.00, 42.00] class: 1 | | | | | | |--- avg_price_per_room > 99.38 | | | | | | | |--- arrival_month <= 8.50 | | | | | | | | |--- required_car_parking_space <= 0.50 | | | | | | | | | |--- avg_price_per_room <= 119.25 | | | | | | | | | | |--- avg_price_per_room <= 117.25 | | | | | | | | | | | |--- weights: [38.00, 22.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 117.25 | | | | | | | | | | | |--- weights: [2.00, 13.00] class: 1 | | | | | | | | | |--- avg_price_per_room > 119.25 | | | | | | | | | | |--- weights: [102.00, 42.00] class: 0 | | | | | | | | |--- required_car_parking_space > 0.50 | | | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | | | |--- arrival_month > 8.50 | | | | | | | | |--- lead_time <= 9.50 | | | | | | | | | |--- weights: [7.00, 71.00] class: 1 | | | | | | | | |--- lead_time > 9.50 | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | |--- arrival_date <= 26.00 | | | | | | | | | | | |--- weights: [19.00, 10.00] class: 0 | | | | | | | | | | |--- arrival_date > 26.00 | | | | | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | |--- weights: [0.00, 10.00] class: 1 | | | | |--- avg_price_per_room > 202.67 | | | | | |--- weights: [32.00, 1.00] class: 0 | | | |--- lead_time > 13.50 | | | | |--- avg_price_per_room <= 105.27 | | | | | |--- avg_price_per_room <= 60.07 | | | | | | |--- lead_time <= 84.50 | | | | | | | |--- weights: [5.00, 70.00] class: 1 | | | | | | |--- lead_time > 84.50 | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | |--- arrival_date <= 19.00 | | | | | | | | | |--- weights: [8.00, 1.00] class: 0 | | | | | | | | |--- arrival_date > 19.00 | | | | | | | | | |--- weights: [2.00, 8.00] class: 1 | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | |--- weights: [1.00, 14.00] class: 1 | | | | | |--- avg_price_per_room > 60.07 | | | | | | |--- lead_time <= 25.50 | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | |--- arrival_month <= 1.50 | | | | | | | | | |--- weights: [0.00, 29.00] class: 1 | | | | | | | | |--- arrival_month > 1.50 | | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | | |--- weights: [2.00, 30.00] class: 1 | | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | | |--- weights: [83.00, 59.00] class: 0 | | | | | | | |--- arrival_month > 11.50 | | | | | | | | |--- weights: [0.00, 54.00] class: 1 | | | | | | |--- lead_time > 25.50 | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 <= 0.50 | | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | | |--- lead_time <= 60.50 | | | | | | | | | | | |--- weights: [3.00, 58.00] class: 1 | | | | | | | | | | |--- lead_time > 60.50 | | | | | | | | | | | |--- weights: [42.00, 35.00] class: 0 | | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | | |--- required_car_parking_space <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | | | |--- required_car_parking_space > 0.50 | | | | | | | | | | | |--- weights: [0.00, 12.00] class: 1 | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 > 0.50 | | | | | | | | | |--- arrival_month <= 5.00 | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | |--- arrival_month > 5.00 | | | | | | | | | | |--- weights: [37.00, 1.00] class: 0 | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | |--- required_car_parking_space <= 0.50 | | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | | |--- weights: [1.00, 6.00] class: 1 | | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | | |--- weights: [12.00, 13.00] class: 1 | | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | |--- required_car_parking_space > 0.50 | | | | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | |--- avg_price_per_room > 105.27 | | | | | |--- required_car_parking_space <= 0.50 | | | | | | |--- arrival_month <= 10.50 | | | | | | | |--- avg_price_per_room <= 195.30 | | | | | | | | |--- lead_time <= 54.50 | | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | | |--- weights: [4.00, 17.00] class: 1 | | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | | |--- lead_time <= 33.50 | | | | | | | | | | | |--- truncated branch of depth 11 | | | | | | | | | | |--- lead_time > 33.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | |--- lead_time > 54.50 | | | | | | | | | |--- arrival_month <= 8.50 | | | | | | | | | | |--- lead_time <= 135.50 | | | | | | | | | | | |--- weights: [545.00, 159.00] class: 0 | | | | | | | | | | |--- lead_time > 135.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- arrival_month > 8.50 | | | | | | | | | | |--- weights: [201.00, 27.00] class: 0 | | | | | | | |--- avg_price_per_room > 195.30 | | | | | | | | |--- weights: [98.00, 2.00] class: 0 | | | | | | |--- arrival_month > 10.50 | | | | | | | |--- lead_time <= 22.50 | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | |--- weights: [4.00, 1.00] class: 0 | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | |--- weights: [0.00, 22.00] class: 1 | | | | | | | |--- lead_time > 22.50 | | | | | | | | |--- avg_price_per_room <= 168.06 | | | | | | | | | |--- avg_price_per_room <= 147.75 | | | | | | | | | | |--- weights: [41.00, 27.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 147.75 | | | | | | | | | | |--- weights: [15.00, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 168.06 | | | | | | | | | |--- weights: [4.00, 16.00] class: 1 | | | | | |--- required_car_parking_space > 0.50 | | | | | | |--- weights: [1.00, 39.00] class: 1 | |--- no_of_special_requests > 0.50 | | |--- no_of_special_requests <= 1.50 | | | |--- market_segment_type_Online <= 0.50 | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | |--- weights: [20.00, 1038.00] class: 1 | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | |--- lead_time <= 63.00 | | | | | | |--- weights: [1.00, 21.00] class: 1 | | | | | |--- lead_time > 63.00 | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | |--- market_segment_type_Online > 0.50 | | | | |--- lead_time <= 8.50 | | | | | |--- weights: [71.00, 1015.00] class: 1 | | | | |--- lead_time > 8.50 | | | | | |--- required_car_parking_space <= 0.50 | | | | | | |--- avg_price_per_room <= 127.62 | | | | | | | |--- no_of_weekend_nights <= 2.50 | | | | | | | | |--- lead_time <= 43.50 | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | |--- arrival_month <= 1.50 | | | | | | | | | | | |--- weights: [0.00, 87.00] class: 1 | | | | | | | | | | |--- arrival_month > 1.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | |--- weights: [0.00, 128.00] class: 1 | | | | | | | | |--- lead_time > 43.50 | | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | | |--- arrival_month <= 7.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- arrival_month > 7.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | | |--- arrival_month <= 8.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | | |--- arrival_month > 8.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | |--- no_of_weekend_nights > 2.50 | | | | | | | | |--- weights: [22.00, 12.00] class: 0 | | | | | | |--- avg_price_per_room > 127.62 | | | | | | | |--- lead_time <= 142.50 | | | | | | | | |--- arrival_month <= 8.50 | | | | | | | | | |--- arrival_date <= 19.50 | | | | | | | | | | |--- avg_price_per_room <= 177.15 | | | | | | | | | | | |--- weights: [50.00, 250.00] class: 1 | | | | | | | | | | |--- avg_price_per_room > 177.15 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- arrival_date > 19.50 | | | | | | | | | | |--- arrival_date <= 27.50 | | | | | | | | | | | |--- weights: [58.00, 80.00] class: 1 | | | | | | | | | | |--- arrival_date > 27.50 | | | | | | | | | | | |--- weights: [16.00, 61.00] class: 1 | | | | | | | | |--- arrival_month > 8.50 | | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | | | |--- weights: [5.00, 34.00] class: 1 | | | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | | |--- lead_time <= 100.50 | | | | | | | | | | | |--- weights: [0.00, 49.00] class: 1 | | | | | | | | | | |--- lead_time > 100.50 | | | | | | | | | | | |--- weights: [6.00, 1.00] class: 0 | | | | | | | |--- lead_time > 142.50 | | | | | | | | |--- avg_price_per_room <= 142.65 | | | | | | | | | |--- weights: [2.00, 6.00] class: 1 | | | | | | | | |--- avg_price_per_room > 142.65 | | | | | | | | | |--- weights: [12.00, 1.00] class: 0 | | | | | |--- required_car_parking_space > 0.50 | | | | | | |--- weights: [1.00, 180.00] class: 1 | | |--- no_of_special_requests > 1.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_week_nights <= 3.50 | | | | | |--- weights: [0.00, 2126.00] class: 1 | | | | |--- no_of_week_nights > 3.50 | | | | | |--- weights: [38.00, 312.00] class: 1 | | | |--- lead_time > 90.50 | | | | |--- avg_price_per_room <= 202.95 | | | | | |--- arrival_month <= 8.50 | | | | | | |--- arrival_year <= 2017.50 | | | | | | | |--- arrival_month <= 7.50 | | | | | | | | |--- weights: [6.00, 2.00] class: 0 | | | | | | | |--- arrival_month > 7.50 | | | | | | | | |--- weights: [2.00, 12.00] class: 1 | | | | | | |--- arrival_year > 2017.50 | | | | | | | |--- lead_time <= 150.50 | | | | | | | | |--- weights: [19.00, 272.00] class: 1 | | | | | | | |--- lead_time > 150.50 | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | |--- arrival_month > 8.50 | | | | | | |--- no_of_special_requests <= 2.50 | | | | | | | |--- avg_price_per_room <= 90.42 | | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | | |--- arrival_date <= 21.50 | | | | | | | | | | |--- weights: [19.00, 6.00] class: 0 | | | | | | | | | |--- arrival_date > 21.50 | | | | | | | | | | |--- weights: [4.00, 9.00] class: 1 | | | | | | | | |--- arrival_month > 11.50 | | | | | | | | | |--- weights: [5.00, 21.00] class: 1 | | | | | | | |--- avg_price_per_room > 90.42 | | | | | | | | |--- weights: [42.00, 107.00] class: 1 | | | | | | |--- no_of_special_requests > 2.50 | | | | | | | |--- weights: [0.00, 52.00] class: 1 | | | | |--- avg_price_per_room > 202.95 | | | | | |--- weights: [7.00, 0.00] class: 0 |--- lead_time > 151.50 | |--- avg_price_per_room <= 100.04 | | |--- no_of_special_requests <= 0.50 | | | |--- market_segment_type_Online <= 0.50 | | | | |--- no_of_adults <= 1.50 | | | | | |--- lead_time <= 163.50 | | | | | | |--- avg_price_per_room <= 85.50 | | | | | | | |--- weights: [1.00, 5.00] class: 1 | | | | | | |--- avg_price_per_room > 85.50 | | | | | | | |--- weights: [15.00, 0.00] class: 0 | | | | | |--- lead_time > 163.50 | | | | | | |--- lead_time <= 341.00 | | | | | | | |--- lead_time <= 173.00 | | | | | | | | |--- arrival_date <= 3.50 | | | | | | | | | |--- weights: [6.00, 63.00] class: 1 | | | | | | | | |--- arrival_date > 3.50 | | | | | | | | | |--- avg_price_per_room <= 70.85 | | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | | | |--- avg_price_per_room > 70.85 | | | | | | | | | | |--- weights: [9.00, 0.00] class: 0 | | | | | | | |--- lead_time > 173.00 | | | | | | | | |--- weights: [8.00, 262.00] class: 1 | | | | | | |--- lead_time > 341.00 | | | | | | | |--- no_of_week_nights <= 4.00 | | | | | | | | |--- weights: [10.00, 17.00] class: 1 | | | | | | | |--- no_of_week_nights > 4.00 | | | | | | | | |--- weights: [8.00, 1.00] class: 0 | | | | |--- no_of_adults > 1.50 | | | | | |--- avg_price_per_room <= 84.58 | | | | | | |--- lead_time <= 244.00 | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | |--- lead_time <= 166.50 | | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | | | |--- lead_time > 166.50 | | | | | | | | | | |--- arrival_date <= 19.00 | | | | | | | | | | | |--- weights: [38.00, 1.00] class: 0 | | | | | | | | | | |--- arrival_date > 19.00 | | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | |--- weights: [0.00, 24.00] class: 1 | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | |--- avg_price_per_room <= 66.50 | | | | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | | | | |--- weights: [7.00, 2.00] class: 0 | | | | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | | | | |--- weights: [2.00, 17.00] class: 1 | | | | | | | | |--- avg_price_per_room > 66.50 | | | | | | | | | |--- weights: [9.00, 123.00] class: 1 | | | | | | |--- lead_time > 244.00 | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | |--- arrival_year <= 2017.50 | | | | | | | | | |--- weights: [0.00, 34.00] class: 1 | | | | | | | | |--- arrival_year > 2017.50 | | | | | | | | | |--- avg_price_per_room <= 80.38 | | | | | | | | | | |--- no_of_week_nights <= 3.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- no_of_week_nights > 3.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- avg_price_per_room > 80.38 | | | | | | | | | | |--- weights: [0.00, 11.00] class: 1 | | | | | | | |--- arrival_month > 11.50 | | | | | | | | |--- weights: [0.00, 37.00] class: 1 | | | | | |--- avg_price_per_room > 84.58 | | | | | | |--- arrival_month <= 11.50 | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | |--- room_type_reserved_Room_Type 4 <= 0.50 | | | | | | | | | |--- weights: [313.00, 10.00] class: 0 | | | | | | | | |--- room_type_reserved_Room_Type 4 > 0.50 | | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | |--- arrival_month <= 6.50 | | | | | | | | | |--- weights: [13.00, 0.00] class: 0 | | | | | | | | |--- arrival_month > 6.50 | | | | | | | | | |--- weights: [0.00, 14.00] class: 1 | | | | | | |--- arrival_month > 11.50 | | | | | | | |--- weights: [0.00, 9.00] class: 1 | | | |--- market_segment_type_Online > 0.50 | | | | |--- avg_price_per_room <= 2.50 | | | | | |--- weights: [4.00, 12.00] class: 1 | | | | |--- avg_price_per_room > 2.50 | | | | | |--- weights: [601.00, 5.00] class: 0 | | |--- no_of_special_requests > 0.50 | | | |--- no_of_weekend_nights <= 0.50 | | | | |--- lead_time <= 180.50 | | | | | |--- weights: [8.00, 60.00] class: 1 | | | | |--- lead_time > 180.50 | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | |--- no_of_special_requests <= 2.50 | | | | | | | |--- arrival_month <= 11.50 | | | | | | | | |--- avg_price_per_room <= 33.75 | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | |--- avg_price_per_room > 33.75 | | | | | | | | | |--- weights: [125.00, 0.00] class: 0 | | | | | | | |--- arrival_month > 11.50 | | | | | | | | |--- weights: [11.00, 8.00] class: 0 | | | | | | |--- no_of_special_requests > 2.50 | | | | | | | |--- weights: [0.00, 12.00] class: 1 | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | |--- weights: [4.00, 17.00] class: 1 | | | |--- no_of_weekend_nights > 0.50 | | | | |--- market_segment_type_Offline <= 0.50 | | | | | |--- no_of_week_nights <= 9.50 | | | | | | |--- arrival_month <= 11.50 | | | | | | | |--- arrival_date <= 27.50 | | | | | | | | |--- weights: [47.00, 269.00] class: 1 | | | | | | | |--- arrival_date > 27.50 | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | |--- lead_time <= 224.50 | | | | | | | | | | |--- weights: [10.00, 1.00] class: 0 | | | | | | | | | |--- lead_time > 224.50 | | | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | |--- lead_time <= 269.00 | | | | | | | | | | |--- weights: [8.00, 35.00] class: 1 | | | | | | | | | |--- lead_time > 269.00 | | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | |--- arrival_month > 11.50 | | | | | | | |--- weights: [21.00, 26.00] class: 1 | | | | | |--- no_of_week_nights > 9.50 | | | | | | |--- weights: [7.00, 1.00] class: 0 | | | | |--- market_segment_type_Offline > 0.50 | | | | | |--- weights: [5.00, 151.00] class: 1 | |--- avg_price_per_room > 100.04 | | |--- arrival_month <= 11.50 | | | |--- no_of_special_requests <= 2.50 | | | | |--- weights: [2108.00, 0.00] class: 0 | | | |--- no_of_special_requests > 2.50 | | | | |--- weights: [0.00, 31.00] class: 1 | | |--- arrival_month > 11.50 | | | |--- no_of_special_requests <= 0.50 | | | | |--- weights: [0.00, 47.00] class: 1 | | | |--- no_of_special_requests > 0.50 | | | | |--- arrival_date <= 24.50 | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | |--- arrival_date > 24.50 | | | | | |--- weights: [15.00, 5.00] class: 0
print (pd.DataFrame(best_model.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp lead_time 0.402838 avg_price_per_room 0.155707 market_segment_type_Online 0.138645 no_of_special_requests 0.100398 arrival_month 0.056324 arrival_date 0.034973 no_of_weekend_nights 0.031197 no_of_adults 0.023968 no_of_week_nights 0.017755 arrival_year 0.014170 required_car_parking_space 0.010020 market_segment_type_Offline 0.005098 type_of_meal_plan_Not Selected 0.002692 type_of_meal_plan_Meal Plan 2 0.002174 room_type_reserved_Room_Type 4 0.001874 room_type_reserved_Room_Type 5 0.001078 room_type_reserved_Room_Type 2 0.000613 no_of_children 0.000475 no_of_previous_bookings_not_canceled 0.000000 no_of_previous_cancellations 0.000000 repeated_guest 0.000000 room_type_reserved_Room_Type 3 0.000000 room_type_reserved_Room_Type 6 0.000000 room_type_reserved_Room_Type 7 0.000000 market_segment_type_Complementary 0.000000 market_segment_type_Corporate 0.000000 type_of_meal_plan_Meal Plan 3 0.000000
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
comparison_frame = pd.DataFrame({'Model':['Initial decision tree model','Decision tree with restricted maximum depth','Decision treee with hyperparameter tuning',
'Decision tree with post-pruning'], 'Train_Recall':[0.99,0.81,1,0.94], 'Test_Recall':[0.89,0.81,1,0.93]})
comparison_frame
| Model | Train_Recall | Test_Recall | |
|---|---|---|---|
| 0 | Initial decision tree model | 0.99 | 0.89 |
| 1 | Decision tree with restricted maximum depth | 0.81 | 0.81 |
| 2 | Decision treee with hyperparameter tuning | 1.00 | 1.00 |
| 3 | Decision tree with post-pruning | 0.94 | 0.93 |
Although the decision tree with the highest recall is the decision tree with hyperparameter tuning, this tree only has one node to the entire tree and doesn't represent the data.
The tree with the highest test recall value, besides the tree with hyperparameter tuning, is the decision tree with post-pruning. This is the best model of the decision trees.